Towards Data Operations
Dr. Andrea Monacchi
Streaming data, models and code as first-class citizens
Riservato Nome personalizzato dell'azienda Versione 1.0
1. About Myself
2. Big Data Architectures
3. Stream-based computing
4. Integrating the Data Science workflow
Summary
About
● BS Computer Science (2010)
● MS Computer Science (2012)
● PhD Information Technology (2016)
● Consultancy experience @ Reply DE (2016-2018)
● Independent Consultant
Big Data
Architectures
Big Data: why are we here?
Loads:
● Transactional (OLTP)
○ all operations
○ ACID properties: atomicity, consistency, isolation,
and durability
● Analytical (OLAP)
○ append-heavy loads
○ aggregations and explorative queries (analytics)
○ hierarchical indexing (OLAP hyper-cubes)
Scalability:
● ACID properties costly
● CAP Theorem
○ impossible to simultaneously distributed data load
and fulfill 3 properties: consistency, availability,
partition tolerance (tolerance to communication
errors)
○ CA are classic RDBMS - vertical scaling only
○ CP (e.g. quorum-based) and AP are NoSQL DBs
○ e.g. Dynamo DB (eventual consistency, AP)
● NoSQL databases
○ relax ACID properties to achieve horizontal scaling
Scalability
● Key/Value Storage (DHT)
○ decentralised, scalable, fault-tolerant
○ easy to partition and distribute data by key
■ e.g. p = hash(k) % num_partitions
○ replication (partition redundancy)
■ 2: error detection
■ 3: error recovery
● Parallel collections
○ e.g. Spark RDD based on underlying HDFS blocks
○ master-slave (or driver-worker) coordination
○ no shared variables (accumulators, broadcast vars)
Parallel computation:
1. Split dataset into multiple partitions/shards
2. Independently process each partition
3. Combine partitions into result
● MapReduce (Functional Programming)
● Split-apply-combine
● Google’s MapReduce
● Cluster Manager
○ Yarn, Mesos, K8s
○ resource isolation and scheduling
○ security, fault-tolerance, monitoring
● Data Serialization formats
○ Text: CSV, JSON, XML,..
○ Binary: SeqFile, Avro, Parquet, ORC, ..
● Batch Processing
○ Hadoop MapReduce variants
■ Pig, Hive
○ Apache Spark
○ Python Dask
● Workflow management tools
○ Fault-tolerant task coordination
○ Oozie, Airflow, Argo (K8s)
Architectures for Data Analytics
● Stages
○ Ingestion (with retention)
○ (re)-processing
○ Presentation/Serving (indexed data, OLAP)
● Lambda Vs. Kappa architecture
○ batch for correctness, streaming for speed
○ mix of technologies (CAP theorem)
○ complexity & operational costs
Stream-based
computing
1st phase - Ingestion: MQTT
● Pub/Sub
○ Totally uncoupled clients
○ messages can be explicitly retained (flag)
● QoS
○ multilevel (0: at most once, 1: at least once, 2: exactly once)
● Persistent sessions
○ broker keeps further info of clients to speed up reconnections
○ queuing of messages in QoS 1 and 2 (disconnections)
○ queuing of all acks for QoS 2
● Last will and Testament messages (LWT)
○ kept until proper client disconnect is sent to the broker
● fault-tolerant pub/sub system with message retention
● exactly once semantics (from v0.11.0) using transactional.id flag (and acks)
● topic as an append-only file log
○ ordering by arrival time (offset)
○ stores changes on the data source (deltas)
○ topic/log consists of multiple partitions
■ partitioning instructed by producer
■ guaranteed message ordering within partition
■ based on message key
● hash(k) % num_partitions
● if k == null, then round robin is used
■ distribution and replication by partition
● 1 elected active (leader) and n replicas
● Zookeeper
1st phase - Ingestion: Kafka
P
Broker2
Broker1
Partition0
Partition1
Partition2
Partition3
Broker3
Partition4
Partition5
● topic maps to local directory
○ 1 file created per each topic partition
○ log rolling: upon size or time limit partition file is rolled
and new one is created
■ log.roll.ms or log.roll.hours = 168
○ log retention:
■ save rolled log to segment files
■ older segments deleted log.retention.hours
or log.retention.bytes
○ log compaction:
■ deletion of older records by key (latest value)
■ log.cleanup.policy=compact on topic
● number of topic partitions (howto)
○ more data -> more partitions
○ proportional to desired throughput
○ overhead for open file handles and TCP connections
■ n partitions, each with m replicas, n*m
■ total partitions on a broker < 100 *
num_brokers * replication_factor, (divide for
all topics)
1st phase - Ingestion: Kafka
● APIs
○ consumer / producer (single thread)
○ Kafka connect
○ KSQL
● Producer (Stateless wrt Broker)
○ ProducerRecord: (topic : String, partition : int,
timestamp : long, key : K, value : V)
○ partition id and time are optional (for manual setup)
● Kafka Connect
○ API for connectors (Source Vs Sink)
○ automatic offset management (commit and restore)
○ at least once delivery
○ exactly once only for certain connectors
○ standalone Vs. distributed modes
○ connectors configurable via REST interface
1st phase - Ingestion: Kafka
● Consumer
○ stateful (own offset wrt topic’s)
○ earliest Vs. latest recovery
○ offset committing: manual, periodical (default 5 secs)
○ consumers can be organized in load-balanced groups (per partition)
○ ideally: consumer’s threads = topic partitions
2nd phase - Stream processing
● streams
○ bounded - can be ingested and batch processed
○ unbounded - processed per event
● stream partitions
○ unit of parallelism
● stream and table duality
○ log changes <-> table
○ Declarative SQL APIs (e.g. KSQL, TableAPI)
● time
○ event time, ingestion time, processing time
○ late message handling (pre-buffering)
● windowing
○ bounded aggregations
○ time or data driven (e.g. count) windows
○ tumbling, sliding and session windows
● stateful operations/transformations
○ state: intermediate result, in-memory key-value store
used across windows or microbatches
○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree
○ e.g. changeLogging back to Kafka topic (KafkaStreams)
● checkpointing
○ save app status for failure recovery (stream replay)
● frameworks
○ Spark Streaming (u-batching), Kafka Streams and Flink
○ Apache Storm, Apache Samza
Code Examples
Flink
● processing webserver logs for frauds
● count downloaded assets per user
● https://github.com/pilillo/flink-quickstart
● Deploy to Kubernetes
SparkStreaming & Kafka
● https://github.com/pilillo/sparkstreaming-quickstart
Kafka Streams
● project skeleton
Integrating the
Data Science
workflow
Data Science workflow
Technical gaps potentially resulting from this process!
● Data Analytics projects
○ stream of research questions and feedback
● Data forked for exploration
○ Data versioning
○ Periodic data quality assessment
● Team misalignment
○ scientists working aside dev team
○ information gaps w/ team & stakeholders
○ unexpected behaviors upon changes
● Results may not be reproducible
● Releases are not frequent
○ Value misalignment (waste of resources)
● CICD only used for data preparation
Data Operations (DataOps)
● DevOps approaches
○ lean product development (continuous feedback and value delivery) using CICD approaches
○ cross-functional teams with mix of development & operations skillset
● DataOps
○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook
○ mix of data engineering and data science skill set
○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
Data Operations workflow
Data Science workflow
CICD activities
CICD for ML
● Continuous Data Quality Assessment
○ data versioning (e.g. apache pachyderm)
○ syntax (e.g. confluent schema registry)
○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency
● Model Tuning
○ hyperparameters tuning - black-box optimization
○ autoML - model selection
○ continuous performance evaluation (wrt newer input data)
○ stakeholder/user performance (e.g. AB testing)
● Model Deployment
○ TF-serve, Seldon
● ML workflow management
○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon
CICD for ML code
See also: https://github.com/EthicalML/awesome-machine-learning-operations
Data-Mill project
● Based on Kubernetes
○ open and scalable
○ seamless integration of bare-metal and cloud-provided clusters
● Enforcing DataOps principles
○ continuous asset monitoring (code, data, models)
○ open-source tools to reproduce and serve models
● Flavour-based organization of components
○ flavour = cluster_spec + SW_components
● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries)
https://data-mill-cloud.github.io/data-mill/
Thank you!

Towards Data Operations

  • 1.
    Towards Data Operations Dr.Andrea Monacchi Streaming data, models and code as first-class citizens
  • 2.
    Riservato Nome personalizzatodell'azienda Versione 1.0 1. About Myself 2. Big Data Architectures 3. Stream-based computing 4. Integrating the Data Science workflow Summary
  • 3.
    About ● BS ComputerScience (2010) ● MS Computer Science (2012) ● PhD Information Technology (2016) ● Consultancy experience @ Reply DE (2016-2018) ● Independent Consultant
  • 4.
  • 5.
    Big Data: whyare we here? Loads: ● Transactional (OLTP) ○ all operations ○ ACID properties: atomicity, consistency, isolation, and durability ● Analytical (OLAP) ○ append-heavy loads ○ aggregations and explorative queries (analytics) ○ hierarchical indexing (OLAP hyper-cubes) Scalability: ● ACID properties costly ● CAP Theorem ○ impossible to simultaneously distributed data load and fulfill 3 properties: consistency, availability, partition tolerance (tolerance to communication errors) ○ CA are classic RDBMS - vertical scaling only ○ CP (e.g. quorum-based) and AP are NoSQL DBs ○ e.g. Dynamo DB (eventual consistency, AP) ● NoSQL databases ○ relax ACID properties to achieve horizontal scaling
  • 6.
    Scalability ● Key/Value Storage(DHT) ○ decentralised, scalable, fault-tolerant ○ easy to partition and distribute data by key ■ e.g. p = hash(k) % num_partitions ○ replication (partition redundancy) ■ 2: error detection ■ 3: error recovery ● Parallel collections ○ e.g. Spark RDD based on underlying HDFS blocks ○ master-slave (or driver-worker) coordination ○ no shared variables (accumulators, broadcast vars) Parallel computation: 1. Split dataset into multiple partitions/shards 2. Independently process each partition 3. Combine partitions into result ● MapReduce (Functional Programming) ● Split-apply-combine ● Google’s MapReduce
  • 7.
    ● Cluster Manager ○Yarn, Mesos, K8s ○ resource isolation and scheduling ○ security, fault-tolerance, monitoring ● Data Serialization formats ○ Text: CSV, JSON, XML,.. ○ Binary: SeqFile, Avro, Parquet, ORC, .. ● Batch Processing ○ Hadoop MapReduce variants ■ Pig, Hive ○ Apache Spark ○ Python Dask ● Workflow management tools ○ Fault-tolerant task coordination ○ Oozie, Airflow, Argo (K8s) Architectures for Data Analytics ● Stages ○ Ingestion (with retention) ○ (re)-processing ○ Presentation/Serving (indexed data, OLAP) ● Lambda Vs. Kappa architecture ○ batch for correctness, streaming for speed ○ mix of technologies (CAP theorem) ○ complexity & operational costs
  • 8.
  • 9.
    1st phase -Ingestion: MQTT ● Pub/Sub ○ Totally uncoupled clients ○ messages can be explicitly retained (flag) ● QoS ○ multilevel (0: at most once, 1: at least once, 2: exactly once) ● Persistent sessions ○ broker keeps further info of clients to speed up reconnections ○ queuing of messages in QoS 1 and 2 (disconnections) ○ queuing of all acks for QoS 2 ● Last will and Testament messages (LWT) ○ kept until proper client disconnect is sent to the broker
  • 10.
    ● fault-tolerant pub/subsystem with message retention ● exactly once semantics (from v0.11.0) using transactional.id flag (and acks) ● topic as an append-only file log ○ ordering by arrival time (offset) ○ stores changes on the data source (deltas) ○ topic/log consists of multiple partitions ■ partitioning instructed by producer ■ guaranteed message ordering within partition ■ based on message key ● hash(k) % num_partitions ● if k == null, then round robin is used ■ distribution and replication by partition ● 1 elected active (leader) and n replicas ● Zookeeper 1st phase - Ingestion: Kafka P Broker2 Broker1 Partition0 Partition1 Partition2 Partition3 Broker3 Partition4 Partition5
  • 11.
    ● topic mapsto local directory ○ 1 file created per each topic partition ○ log rolling: upon size or time limit partition file is rolled and new one is created ■ log.roll.ms or log.roll.hours = 168 ○ log retention: ■ save rolled log to segment files ■ older segments deleted log.retention.hours or log.retention.bytes ○ log compaction: ■ deletion of older records by key (latest value) ■ log.cleanup.policy=compact on topic ● number of topic partitions (howto) ○ more data -> more partitions ○ proportional to desired throughput ○ overhead for open file handles and TCP connections ■ n partitions, each with m replicas, n*m ■ total partitions on a broker < 100 * num_brokers * replication_factor, (divide for all topics) 1st phase - Ingestion: Kafka
  • 12.
    ● APIs ○ consumer/ producer (single thread) ○ Kafka connect ○ KSQL ● Producer (Stateless wrt Broker) ○ ProducerRecord: (topic : String, partition : int, timestamp : long, key : K, value : V) ○ partition id and time are optional (for manual setup) ● Kafka Connect ○ API for connectors (Source Vs Sink) ○ automatic offset management (commit and restore) ○ at least once delivery ○ exactly once only for certain connectors ○ standalone Vs. distributed modes ○ connectors configurable via REST interface 1st phase - Ingestion: Kafka ● Consumer ○ stateful (own offset wrt topic’s) ○ earliest Vs. latest recovery ○ offset committing: manual, periodical (default 5 secs) ○ consumers can be organized in load-balanced groups (per partition) ○ ideally: consumer’s threads = topic partitions
  • 13.
    2nd phase -Stream processing ● streams ○ bounded - can be ingested and batch processed ○ unbounded - processed per event ● stream partitions ○ unit of parallelism ● stream and table duality ○ log changes <-> table ○ Declarative SQL APIs (e.g. KSQL, TableAPI) ● time ○ event time, ingestion time, processing time ○ late message handling (pre-buffering) ● windowing ○ bounded aggregations ○ time or data driven (e.g. count) windows ○ tumbling, sliding and session windows ● stateful operations/transformations ○ state: intermediate result, in-memory key-value store used across windows or microbatches ○ e.g. RocksDB (Flink+KafkaStreams), LSM-tree ○ e.g. changeLogging back to Kafka topic (KafkaStreams) ● checkpointing ○ save app status for failure recovery (stream replay) ● frameworks ○ Spark Streaming (u-batching), Kafka Streams and Flink ○ Apache Storm, Apache Samza
  • 14.
    Code Examples Flink ● processingwebserver logs for frauds ● count downloaded assets per user ● https://github.com/pilillo/flink-quickstart ● Deploy to Kubernetes SparkStreaming & Kafka ● https://github.com/pilillo/sparkstreaming-quickstart Kafka Streams ● project skeleton
  • 15.
  • 16.
    Data Science workflow Technicalgaps potentially resulting from this process! ● Data Analytics projects ○ stream of research questions and feedback ● Data forked for exploration ○ Data versioning ○ Periodic data quality assessment ● Team misalignment ○ scientists working aside dev team ○ information gaps w/ team & stakeholders ○ unexpected behaviors upon changes ● Results may not be reproducible ● Releases are not frequent ○ Value misalignment (waste of resources) ● CICD only used for data preparation
  • 17.
    Data Operations (DataOps) ●DevOps approaches ○ lean product development (continuous feedback and value delivery) using CICD approaches ○ cross-functional teams with mix of development & operations skillset ● DataOps ○ devops for streaming data and analytics as a manufacturing process - Manifesto, Cookbook ○ mix of data engineering and data science skill set ○ focus: continuous data quality assessment, model reproducibility, incremental/progressive delivery of value
  • 18.
  • 19.
  • 20.
  • 21.
    ● Continuous DataQuality Assessment ○ data versioning (e.g. apache pachyderm) ○ syntax (e.g. confluent schema registry) ○ semantic (e.g. apache griffin) - accuracy, completeness, timeliness, uniqueness, validity, consistency ● Model Tuning ○ hyperparameters tuning - black-box optimization ○ autoML - model selection ○ continuous performance evaluation (wrt newer input data) ○ stakeholder/user performance (e.g. AB testing) ● Model Deployment ○ TF-serve, Seldon ● ML workflow management ○ Amazon Sagemaker, Google ML Engine, MLflow, Kubeflow, Polyaxon CICD for ML code See also: https://github.com/EthicalML/awesome-machine-learning-operations
  • 22.
    Data-Mill project ● Basedon Kubernetes ○ open and scalable ○ seamless integration of bare-metal and cloud-provided clusters ● Enforcing DataOps principles ○ continuous asset monitoring (code, data, models) ○ open-source tools to reproduce and serve models ● Flavour-based organization of components ○ flavour = cluster_spec + SW_components ● Built-in exploration environments (dashboarding tools, jupyter notebooks with DS libraries) https://data-mill-cloud.github.io/data-mill/
  • 23.