SlideShare a Scribd company logo
Software Architecture
for Data Applications
Ding Li 2021.6
A Sample Data System
Service Level Objectives (SLOs)
Service Level Agreements (SLAs)
A service may be required to be up at least
99.9% of the time.
A service may be considered up if:
• Median response time < 200 ms
• 99th percentile < 1s
Elastic Service:
Automatically add computing resources when
detecting a load increase
Operability: Making life easy for operations
Simplicity: Managing Complexity
Evolvability: Making Change Easy
Data Models
Relational Model (SQL) Document Model (NoSQL)
Tree Structure
Pro: join
Con: locality
Pro: locality
Con: join
Json support in relational databases: PostgreSQL, MySQL, SQL, Oracle
Query Languages
Imperative Language
tells the computer to perform certain operations in a certain order
Declarative Language (SQL)
just specify the pattern of the data you want, but not how to achieve
that goal. (often lend themselves to parallel execution)
Declarative Queries on the Web
the query optimizer automatically decides which parts of the query to
execute in which order, and which indexes to use
all <p> elements whose direct parent is an
<li> element with a CSS class of selected
XSL (XPath):
Imperative Approach in JavaScript
Graph Models & Query Language
Find people who emigrated from the US to Europe:
Cypher Query Language (Neo4j)
Using Recursive Common Table Expressions in SQL:
follow a chain of outgoing WITHIN edges until eventually you reach a vertex
of type Location, whose name property is equal to "United States".
Data Warehouse and Star Schema
OLTP: Online Transaction Processing
OLAP: Online Analytic Processing
Column-Oriented Storage
Although fact tables are often over 100 columns wide, a typical data
warehouse query only accesses a few of them at one time
Column Compression (when values in a column are not unique)
redshift parquet
If fact table 10M rows,
but only 100 products,
good compression ratio
Partitioning (Sharding)
Combining Replication and Partitioning
Partitioning by Key Range
If using (update_time) as Key:
• Range scan is easy
• Can lead to hot spots
If using (user_id, update_time) as key:
• Different users may be stored on different partitions
• Within each user, the updates are stored ordered by
timestamp on a single partition
Request Routing
ZooKeeper: keep track of cluster metadata (used by Kafka, SolrCloud, HBase)
Transaction And ACID
Transaction: group several reads and writes together into a logical unit.
All the reads and writes in a transaction are executed as one operation:
either the entire transaction succeeds (commit) or it fails (abort, rollback).
To be reliable, a system must deal with various faults and ensure
that they don’t cause catastrophic failure of the entire system
If the transaction cannot be completed (committed) due to a fault,
the database must discard or undo any writes it has made so far.
Certain statements about your data (invariants) that must always be
true—for example, in an accounting system, credits and debits across
all accounts must always be balanced.
Concurrently executing transactions are isolated from each other:
they cannot step on each other’s toes.
Durability is the promise that once a transaction has committed
successfully, any data it has written will not be forgotten, even if there
is a hardware fault or the database crashes.
Serializable isolation has a performance cost, many systems to use
weaker levels of isolation, Read Committed is a basic one.
Dirty Read: read an uncommitted writes
user 2: the mailbox listing shows an unread message, but the counter shows zero unread messages
Dirty Write: overwrite an uncommitted value
The sale’s listing is assigned to “Bob”, but recipient to “Alice”
Read Committed & Repeatable Read
No Dirty Read: only committed data can be read
user 2: sees the new value for x only after user 1’s transaction has committed
No Dirty Write: only committed data can be overwritten
Read Committed
Nonrepeatable Read Issue:
Alice see a total of $900 in her accounts, instead of $1000.
Repeatable Read (Snapshot Isolation)
Each transaction reads from a consistent snapshot of the database—
that is, the transaction sees all the data that was committed in the
database at the start of the transaction.
Distributed System: Lock and Fencing Token
Lock and Issue:
Fencing Token (a number that increases every time a lock is granted):
client 1 believes that it still has a valid lease, even
though it has expired, and thus corrupts a file in storage.
Only one transaction or client is allowed to hold the lock
for a particular resource or object, to prevent
concurrently writing to it and corrupting it
Every time a client sends a write request to the storage
service, it must include its current fencing token.
Client 1 comes back to life and sends its write to the
storage service, including its token value 33. However,
the storage server remembers that it has already
processed a write with a higher token number (34), and
so it rejects the request with token 33
Packets can be lost, reordered, duplicated, or arbitrarily delayed in
the network; clocks are approximate at best; and nodes can pause
(e.g., due to garbage collection) or crash at any time.
Batch Processing with Unix Tools, MapReduce, and Spark
• Services [HTTP/REST-based APIs] (online systems)
• Batch Processing Systems (offline systems)
• Stream Processing Systems (near-real-time systems)
Batch Processing Web Logs with Unix Tools
MapReduce and Distributed Filesystems
• Read a set of input files and break it up into records.
• Call the mapper function to extract a key and value from each input record.
awk '{print $7}': it extracts the URL ($7) as the key and leaves the value empty.
• Sort all the key-value pairs by key.
This is done by the first sort command.
• Call the reducer function to iterate over the sorted key-value pairs.
uniq -c, which counts the number of adjacent records with the same key.
Read log file
Split each line, get http ($7 token)
Alphabetically sort by URL
Counts of distinct URL
Sort by number in reverse
Output the first 5 lines
Downsides of MapReduce
• A MapReduce job can only start when all tasks in the preceding jobs (that generate its
inputs) have completed
• Mappers are often redundant: they just read back the same file that was just written
by a reducer and prepare it for the next stage of partitioning and sorting.
• Storing intermediate state in a distributed filesystem means those files are replicated
across several nodes, which is often overkill for such temporary data.
Solution: Dataflow Engines (such as Spark)
• Explicitly model the flow of data through several processing stages
• Parallelize work by partitioning inputs, and they copy the output of one function over
the network to become the input to another function
• All joins and data dependencies in a workflow are explicitly declared, the scheduler
has an overview of what data is required where, so it can make locality optimizations
Stream Processing
Load Balancing (to one consumer) & Fan-out (to all consumers)
Acknowledgments and Redelivery
Partition (can be on different machines) & Topic (same type of messages)
Use of Steam Processing
• Complex Event Processing (CEP)
Describe the patterns of events that should be detected.
• Stream Analytics
Aggregations and statistical metrics of events (rate, rolling average)
• Search on Streams
Search for individual events (mentioning of companies, products, topics)
Keeping Systems in Sync & Stream Joins
Change Data Capture (CDC)
Dual Writes Issue
Log Compaction: the storage engine periodically looks for log records with the same
key, throws away any duplicates, and keeps only the most recent update for each key
Stream Joins
Stream-Stream Joins (Window Join)
To capture the click events for corresponding onsite search events to get CTR.
Matched by session id.
Stream-Table Joins (Stream Enrichment)
Join user activity stream events and user profiles in database.
Can load a copy of database into the stream processor.
Table-Table Joins (Materialized View Maintenance)
Twitter timeline cache
• When user u sends a new tweet, it is added to the timeline of every user who is following u.
• When a user deletes a tweet, it is removed from all users’ timelines.
• When user u1 starts following user u2, recent tweets by u2 are added to u1’s timeline.
• When user u1 unfollows user u2, tweets by u2 are removed from u1’stimeline
Break the stream into small blocks, treat each block like a miniature batch process
The batch size is typically around one second
Apache Spark: A Unified Engine for Large-scale Distributed Data Processing
The central thrust of the Spark project
was to bring in ideas borrowed from
Hadoop MapReduce, but to enhance
the system: make it highly fault
tolerant and parallel, support in-
memory storage for intermediate
results between iterative and
interactive map and reduce
computations, offer easy and
composable APIs in multiple languages
as a programming model, and support
other workloads in a unified manner.
• Take advantage of efficient multithreading and parallel processing
• Directed Acyclic Graph (DAG): efficient computational graph across works
on the cluster
• Generate compact code for execution
Ease of Use
• Resilient Distributed Dataset (RDD)
• a fundamental abstraction of a simple logical data structure
• upon which DataFrames and Datasets, are constructed
• Components: Spark SQL, Spark Structured Steaming, Spark Mllib, Graph X
• API: Scala, SQL, Python, Java, R
• Decouple the storage and compute
• Spark’s DataFrameReaders and DataFrameWriters can also be extended to
read data from other sources, such as Apache Kafka, Kinesis, Azure Storage,
and Amazon S3, into its logical data abstraction, on which it can operate.
Distributed Data and Partitions
A distributed scheme
of breaking up data
into chunks or
partitions allows Spark
executors to process
only data that is close
to them, minimizing
network bandwidth.
That is, each
executor’s core is
assigned its own data
partition to work on
Apache Spark’s Distributed Execution
Spark driver: requests
resources (CPU, memory,
etc.) from the cluster
manager for Spark’s
executors (JVMs);
transforms all the Spark
operations into DAG
computations, schedules
them, and distributes their
execution as tasks across
the Spark executors.
Spark Job/Stage/Task, Transformation/Action
Spark driver converts Spark
application into one or more Spark
jobs. It then transforms each job
into a DAG, Spark’s execution plan.
As part of the DAG nodes,
stages are created based on
what operations can be
performed serially or in parallel
Each task maps to a
single core and
works on a single
partition of data
Transformations, Actions, and Lazy Evaluation
• All transformations are evaluated lazily.
• To optimize transformations into stages
for more efficient execution.
• Nothing in a query plan is executed
until an action is invoked.
Narrow and Wide Transformations
a single output partition
can be computed from a
single input partition
data from other partitions
is read in, combined, and
written to disk
Sample Code: Counting M&Ms for the Cookie Monster
Spark program that reads a file with
over 100,000 entries (where each row
or line has a <state, mnm_color, count>)
and computes and aggregates the
counts for each color and state.
Spark program would distribute the task
in each partition and return us the final
aggregated count.
Apache Spark’s Structured APIs
The RDD is the most basic abstraction in Spark
• Dependencies
instructs Spark how an RDD is constructed with its inputs
• Partitions (with some locality information)
ability to parallelize computation on partitions across executors
• Compute function: Partition => Iterator[T]
produces an Iterator[T] for the data that will be stored in the RDD
Basic Python data types in Spark
Python structured data types in Spark
Schemas and Creating DataFrames
A schema defines the column names and associated data types for a DataFrame.
schema = "author STRING, title STRING, pages INT“
blogs_df = spark.createDataFrame(data, schema)
Spark SQL and the Underlying Engine
• Unifies Spark components and
permits abstraction to
DataFrames/Datasets in Java,
Scala, Python, and R, which
simplifies working with
structured data sets.
• Connects to the Apache Hive
meta store and tables.
• Reads and writes structured data
with a specific schema from
structured file formats (JSON,
CSV, Text, Avro, Parquet, ORC,
etc.) and converts data into
temporary tables.
• Offers an interactive Spark SQL
shell for quick data exploration.
• Provides a bridge to (and from)
external tools via standard
database JDBC/ODBC
• Generates optimized query plans
and compact code for the JVM,
for final execution.
Spark SQL and DataFrames
Create SQL Database
Create Table
Create View
Views disappear after Spark application terminates
Global: visible across all SparkSessions on a given cluster
Session-scoped: visible only to a single SparkSession
Viewing the Metadata
Reading Tables into DataFrames
Reading from Database using JDBC
DataFrameReader and DataFrameWriter
Higher-Order Functions in Spark SQL and DataFrames
Union two different DataFrames with the same schema together
Structured Steaming
Data stream as an unbound table
Append mode: only append new rows
Update mode: update rows changed since last trigger
Complete mode: write entire result table
Structured Streaming Query
Define input source
Transform data
Define output sink & mode
Specify processing details
Triggering details
• Default: next micro-batch is triggered as soon as the previous one has completed
• Processing time with trigger interval: will trigger micro-batches at fixed interval
• Once: will start the query using custom schedule and stop
• Continuous: will process data continuously instead of in micro-batches, experimental
Steaming Data Sources and Sinks
Each file must appear in the directory
listing atomically—that is, the whole file
must be available at once for reading, and
once it is available, the file cannot be
updated or modified.
Structured Streaming will process the file when
the engine finds it (using directory listing) and
internally mark it as processed. Any changes to
that file will not be processed.
Apache Kafka
Fields of returned DataFrame: key, value, topic, partition,
offset, timestamp, timestampType
Stateful Steaming Aggregations
Steam – Stream Join
Aggregations with Event-Time Windows
watermark defines how long the
engine will wait for late data to arrive.
With these time constraints for each event,
the processing engine can automatically
calculate how long events need to be
buffered to generate correct results, and
when the events can be dropped from the
Machine Learning Pipeline
Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
Joblib can be used for hyperparameter tuning as it automatically broadcasts a copy of your
data to all your workers, which then create their own models with different hyperparameters
on their copies of the data.
Pandas is a very popular data analysis and manipulation library in Python, but it is limited to
running on a single machine. Koalas is an open-source library that implements the Pandas
DataFrame API on top of Apache Spark, easing the transition from Pandas to Spark.
AWS SageMaker
Azure ML
Further Reading for Software Architecture

More Related Content

What's hot

Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit

What's hot (20)

Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
Developing a Map Reduce Application
Developing a Map Reduce ApplicationDeveloping a Map Reduce Application
Developing a Map Reduce Application
Hundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario SpacagnaHundreds of queries in the time of one - Gianmario Spacagna
Hundreds of queries in the time of one - Gianmario Spacagna
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
Learning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlibLearning spark ch11 - Machine Learning with MLlib
Learning spark ch11 - Machine Learning with MLlib
Advanced Analytics using Apache Hive
Advanced Analytics using Apache HiveAdvanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2Improving Apache Spark's Reliability with DataSourceV2
Improving Apache Spark's Reliability with DataSourceV2
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL

Similar to Software architecture for data applications

Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
Santal Li
Distribute key value_store
Distribute key value_storeDistribute key value_store
Distribute key value_store
drewz lin

Similar to Software architecture for data applications (20)

Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Lab6 rtos
Lab6 rtosLab6 rtos
Lab6 rtos
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
Database System Architectures
Database System ArchitecturesDatabase System Architectures
Database System Architectures
Sql Server
Sql ServerSql Server
Sql Server
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
101 mistakes has made with Kafka (Baksida meetup)
101 mistakes has made with Kafka (Baksida meetup)101 mistakes has made with Kafka (Baksida meetup)
101 mistakes has made with Kafka (Baksida meetup)
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
Distribute Key Value Store
Distribute Key Value StoreDistribute Key Value Store
Distribute Key Value Store
Distribute key value_store
Distribute key value_storeDistribute key value_store
Distribute key value_store
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data

More from Ding Li

More from Ding Li (13)

Seismic data analysis with u net
Seismic data analysis with u netSeismic data analysis with u net
Seismic data analysis with u net
Titanic survivor prediction by machine learning
Titanic survivor prediction by machine learningTitanic survivor prediction by machine learning
Titanic survivor prediction by machine learning
Find nuclei in images with U-net
Find nuclei in images with U-netFind nuclei in images with U-net
Find nuclei in images with U-net
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Recommendation system
Recommendation systemRecommendation system
Recommendation system
Practical data science
Practical data sciencePractical data science
Practical data science
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
AI to advance science research
AI to advance science researchAI to advance science research
AI to advance science research
Machine learning with graph
Machine learning with graphMachine learning with graph
Machine learning with graph
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
Great neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysisGreat neck school budget 2016-2017 analysis
Great neck school budget 2016-2017 analysis
Business Intelligence and Big Data in Cloud
Business Intelligence and Big Data in CloudBusiness Intelligence and Big Data in Cloud
Business Intelligence and Big Data in Cloud

Recently uploaded

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
Investigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_CrimesInvestigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_Crimes
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s

Recently uploaded (20)

Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
一比一原版(TWU毕业证)西三一大学毕业证成绩单 Cheatsheet: automate your data workflows Cheatsheet: automate your data Cheatsheet: automate your data workflows Cheatsheet: automate your data workflows
Investigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_CrimesInvestigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_Crimes
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s

Software architecture for data applications

  • 1. Software Architecture for Data Applications Ding Li 2021.6
  • 2. 2
  • 3. 3 A Sample Data System Memcached Redis ElastiCache Elasticsearch Solr Service Level Objectives (SLOs) Service Level Agreements (SLAs) A service may be required to be up at least 99.9% of the time. A service may be considered up if: • Median response time < 200 ms • 99th percentile < 1s Elastic Service: Automatically add computing resources when detecting a load increase Operability: Making life easy for operations Simplicity: Managing Complexity Evolvability: Making Change Easy Kinesis Kafka RabbitMQ
  • 4. 4 Data Models Relational Model (SQL) Document Model (NoSQL) Tree Structure Pro: join Con: locality Schema-on-write Pro: locality Con: join Schema-on-read Json support in relational databases: PostgreSQL, MySQL, SQL, Oracle
  • 5. 5 Query Languages Imperative Language tells the computer to perform certain operations in a certain order Declarative Language (SQL) just specify the pattern of the data you want, but not how to achieve that goal. (often lend themselves to parallel execution) Declarative Queries on the Web the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use PostgreSQL CSS: all <p> elements whose direct parent is an <li> element with a CSS class of selected XSL (XPath): Imperative Approach in JavaScript
  • 6. 6 Graph Models & Query Language Find people who emigrated from the US to Europe: Cypher Query Language (Neo4j) Using Recursive Common Table Expressions in SQL: follow a chain of outgoing WITHIN edges until eventually you reach a vertex of type Location, whose name property is equal to "United States".
  • 7. 7 Data Warehouse and Star Schema OLTP: Online Transaction Processing OLAP: Online Analytic Processing
  • 8. 8 Column-Oriented Storage Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses a few of them at one time Column Compression (when values in a column are not unique) redshift parquet If fact table 10M rows, but only 100 products, good compression ratio
  • 9. 9 Partitioning (Sharding) Combining Replication and Partitioning Partitioning by Key Range If using (update_time) as Key: • Range scan is easy • Can lead to hot spots If using (user_id, update_time) as key: • Different users may be stored on different partitions • Within each user, the updates are stored ordered by timestamp on a single partition Request Routing ZooKeeper: keep track of cluster metadata (used by Kafka, SolrCloud, HBase)
  • 10. 10 Transaction And ACID Transaction: group several reads and writes together into a logical unit. All the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (commit) or it fails (abort, rollback). To be reliable, a system must deal with various faults and ensure that they don’t cause catastrophic failure of the entire system ATOMICITY If the transaction cannot be completed (committed) due to a fault, the database must discard or undo any writes it has made so far. CONSISTENCY Certain statements about your data (invariants) that must always be true—for example, in an accounting system, credits and debits across all accounts must always be balanced. ISOLATION Concurrently executing transactions are isolated from each other: they cannot step on each other’s toes. DURABILITY Durability is the promise that once a transaction has committed successfully, any data it has written will not be forgotten, even if there is a hardware fault or the database crashes. Serializable isolation has a performance cost, many systems to use weaker levels of isolation, Read Committed is a basic one. Dirty Read: read an uncommitted writes user 2: the mailbox listing shows an unread message, but the counter shows zero unread messages Dirty Write: overwrite an uncommitted value The sale’s listing is assigned to “Bob”, but recipient to “Alice”
  • 11. 11 Read Committed & Repeatable Read No Dirty Read: only committed data can be read user 2: sees the new value for x only after user 1’s transaction has committed No Dirty Write: only committed data can be overwritten Read Committed Nonrepeatable Read Issue: Alice see a total of $900 in her accounts, instead of $1000. Repeatable Read (Snapshot Isolation) Each transaction reads from a consistent snapshot of the database— that is, the transaction sees all the data that was committed in the database at the start of the transaction.
  • 12. 12 Distributed System: Lock and Fencing Token Lock and Issue: Fencing Token (a number that increases every time a lock is granted): client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage. Only one transaction or client is allowed to hold the lock for a particular resource or object, to prevent concurrently writing to it and corrupting it Every time a client sends a write request to the storage service, it must include its current fencing token. Client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33 Packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time.
  • 13. 13 Batch Processing with Unix Tools, MapReduce, and Spark • Services [HTTP/REST-based APIs] (online systems) • Batch Processing Systems (offline systems) • Stream Processing Systems (near-real-time systems) Batch Processing Web Logs with Unix Tools Output: MapReduce and Distributed Filesystems • Read a set of input files and break it up into records. • Call the mapper function to extract a key and value from each input record. awk '{print $7}': it extracts the URL ($7) as the key and leaves the value empty. • Sort all the key-value pairs by key. This is done by the first sort command. • Call the reducer function to iterate over the sorted key-value pairs. uniq -c, which counts the number of adjacent records with the same key. Read log file Split each line, get http ($7 token) Alphabetically sort by URL Counts of distinct URL Sort by number in reverse Output the first 5 lines Downsides of MapReduce • A MapReduce job can only start when all tasks in the preceding jobs (that generate its inputs) have completed • Mappers are often redundant: they just read back the same file that was just written by a reducer and prepare it for the next stage of partitioning and sorting. • Storing intermediate state in a distributed filesystem means those files are replicated across several nodes, which is often overkill for such temporary data. Solution: Dataflow Engines (such as Spark) • Explicitly model the flow of data through several processing stages • Parallelize work by partitioning inputs, and they copy the output of one function over the network to become the input to another function • All joins and data dependencies in a workflow are explicitly declared, the scheduler has an overview of what data is required where, so it can make locality optimizations
  • 14. 14 Stream Processing Load Balancing (to one consumer) & Fan-out (to all consumers) Acknowledgments and Redelivery Partition (can be on different machines) & Topic (same type of messages) Use of Steam Processing • Complex Event Processing (CEP) Describe the patterns of events that should be detected. • Stream Analytics Aggregations and statistical metrics of events (rate, rolling average) • Search on Streams Search for individual events (mentioning of companies, products, topics)
  • 15. 15 Keeping Systems in Sync & Stream Joins Change Data Capture (CDC) Dual Writes Issue Log Compaction: the storage engine periodically looks for log records with the same key, throws away any duplicates, and keeps only the most recent update for each key Stream Joins Stream-Stream Joins (Window Join) To capture the click events for corresponding onsite search events to get CTR. Matched by session id. Stream-Table Joins (Stream Enrichment) Join user activity stream events and user profiles in database. Can load a copy of database into the stream processor. Table-Table Joins (Materialized View Maintenance) Twitter timeline cache • When user u sends a new tweet, it is added to the timeline of every user who is following u. • When a user deletes a tweet, it is removed from all users’ timelines. • When user u1 starts following user u2, recent tweets by u2 are added to u1’s timeline. • When user u1 unfollows user u2, tweets by u2 are removed from u1’stimeline Microbatching Break the stream into small blocks, treat each block like a miniature batch process The batch size is typically around one second
  • 16. 16
  • 17. 17 Apache Spark: A Unified Engine for Large-scale Distributed Data Processing The central thrust of the Spark project was to bring in ideas borrowed from Hadoop MapReduce, but to enhance the system: make it highly fault tolerant and parallel, support in- memory storage for intermediate results between iterative and interactive map and reduce computations, offer easy and composable APIs in multiple languages as a programming model, and support other workloads in a unified manner. Speed • Take advantage of efficient multithreading and parallel processing • Directed Acyclic Graph (DAG): efficient computational graph across works on the cluster • Generate compact code for execution Ease of Use • Resilient Distributed Dataset (RDD) • a fundamental abstraction of a simple logical data structure • upon which DataFrames and Datasets, are constructed Modularity • Components: Spark SQL, Spark Structured Steaming, Spark Mllib, Graph X • API: Scala, SQL, Python, Java, R Extensibility • Decouple the storage and compute • Spark’s DataFrameReaders and DataFrameWriters can also be extended to read data from other sources, such as Apache Kafka, Kinesis, Azure Storage, and Amazon S3, into its logical data abstraction, on which it can operate. Distributed Data and Partitions A distributed scheme of breaking up data into chunks or partitions allows Spark executors to process only data that is close to them, minimizing network bandwidth. That is, each executor’s core is assigned its own data partition to work on Apache Spark’s Distributed Execution Spark driver: requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); transforms all the Spark operations into DAG computations, schedules them, and distributes their execution as tasks across the Spark executors.
  • 18. 18 Spark Job/Stage/Task, Transformation/Action Spark driver converts Spark application into one or more Spark jobs. It then transforms each job into a DAG, Spark’s execution plan. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel Each task maps to a single core and works on a single partition of data Transformations, Actions, and Lazy Evaluation • All transformations are evaluated lazily. • To optimize transformations into stages for more efficient execution. • Nothing in a query plan is executed until an action is invoked. Narrow and Wide Transformations a single output partition can be computed from a single input partition filter() contains() data from other partitions is read in, combined, and written to disk groupby() orderby()
  • 19. 19 Sample Code: Counting M&Ms for the Cookie Monster Spark program that reads a file with over 100,000 entries (where each row or line has a <state, mnm_color, count>) and computes and aggregates the counts for each color and state. Spark program would distribute the task in each partition and return us the final aggregated count.
  • 20. 20 Apache Spark’s Structured APIs The RDD is the most basic abstraction in Spark • Dependencies instructs Spark how an RDD is constructed with its inputs • Partitions (with some locality information) ability to parallelize computation on partitions across executors • Compute function: Partition => Iterator[T] produces an Iterator[T] for the data that will be stored in the RDD Basic Python data types in Spark Python structured data types in Spark Schemas and Creating DataFrames A schema defines the column names and associated data types for a DataFrame. schema = "author STRING, title STRING, pages INT“ blogs_df = spark.createDataFrame(data, schema)
  • 21. 21 Spark SQL and the Underlying Engine • Unifies Spark components and permits abstraction to DataFrames/Datasets in Java, Scala, Python, and R, which simplifies working with structured data sets. • Connects to the Apache Hive meta store and tables. • Reads and writes structured data with a specific schema from structured file formats (JSON, CSV, Text, Avro, Parquet, ORC, etc.) and converts data into temporary tables. • Offers an interactive Spark SQL shell for quick data exploration. • Provides a bridge to (and from) external tools via standard database JDBC/ODBC connectors. • Generates optimized query plans and compact code for the JVM, for final execution.
  • 22. 22 Spark SQL and DataFrames Create SQL Database Create Table Create View Views disappear after Spark application terminates Global: visible across all SparkSessions on a given cluster Session-scoped: visible only to a single SparkSession Viewing the Metadata Reading Tables into DataFrames Reading from Database using JDBC
  • 24. 24 Higher-Order Functions in Spark SQL and DataFrames Transform() Filter() EXISTS() Union two different DataFrames with the same schema together Join Pivot
  • 25. 25 Structured Steaming Data stream as an unbound table Append mode: only append new rows Update mode: update rows changed since last trigger Complete mode: write entire result table Structured Streaming Query Define input source Transform data Define output sink & mode Specify processing details Triggering details • Default: next micro-batch is triggered as soon as the previous one has completed • Processing time with trigger interval: will trigger micro-batches at fixed interval • Once: will start the query using custom schedule and stop • Continuous: will process data continuously instead of in micro-batches, experimental
  • 26. 26 Steaming Data Sources and Sinks Files Each file must appear in the directory listing atomically—that is, the whole file must be available at once for reading, and once it is available, the file cannot be updated or modified. Structured Streaming will process the file when the engine finds it (using directory listing) and internally mark it as processed. Any changes to that file will not be processed. Apache Kafka Fields of returned DataFrame: key, value, topic, partition, offset, timestamp, timestampType
  • 27. 27 Stateful Steaming Aggregations Steam – Stream Join Aggregations with Event-Time Windows watermark defines how long the engine will wait for late data to arrive. With these time constraints for each event, the processing engine can automatically calculate how long events need to be buffered to generate correct results, and when the events can be dropped from the state.
  • 29. 29 Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark Joblib Joblib can be used for hyperparameter tuning as it automatically broadcasts a copy of your data to all your workers, which then create their own models with different hyperparameters on their copies of the data. Koalas Pandas is a very popular data analysis and manipulation library in Python, but it is limited to running on a single machine. Koalas is an open-source library that implements the Pandas DataFrame API on top of Apache Spark, easing the transition from Pandas to Spark. AWS SageMaker Azure ML
  • 30. 30 Further Reading for Software Architecture