This module has been created to answer all the questions on how IPFS can be used for dynamic real-time applications. In this module, you will learn:
- how to reason about dynamic data on IPFS,
- IPNS, the simplest construction for naming in IPFS,
- how PubSub can offer subsecond speeds for interactive applications,
- how CRDTs are a fundamental building block for distributed applications,
- what is available in the ecosystem.
The content routing system of IPFS is the part of the architecture that discovers content in the network. It is considered by many as the most important part of the architecture, as well as the one with the most open research questions. Through this module, you will learn:
- IPFS's Content Routing Architecture,
- the protocol settings and algorithmics of IPFS’s mighty DHT,
- IPFS's gossip-based content routing approaches.
In this module, you’ll learn:
- IPFS’s Content Exchange protocols, namely, Bitswap and GraphSync,
- the message types and workflows of Bitswap and GraphSync, and
- the differences between the two protocols.
Learn all about the most foundational principle of the IPFS architecture: the IPFS Content Identifier, or CID. Through this module, you will understand:
- how IPFS addresses files,
- how IPFS transforms files into Merkle DAGs, as well as
- the detailed anatomy of a CID!
Web 3.0 joins Blockchains, the Semantic Web, and the Distributed Web in one package and creates a revolution, which will change the way we do networking! In this module, you will learn:
- the issues facing Web2.0 and the motivation for the Web3.0 movement,
- IPFS and its role in Web3.0,
- the progress so far in the IPFS ecosystem.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1RJcfss.
Juan Batiz-Benet makes a short intro of IPFS (the InterPlanetary File System), a new hypermedia distribution protocol, addressed by content and identities. He also discusses the IPLD data model and example data structures (unixfs, keychain, post). Filmed at qconsf.com.
Juan Batiz-Benet is an Independent Scientist.
ArcBlock Technical Learning Series Presents IPFS.
If there's a missing piece in current blockchain stack, that'll be a decentralized, public verifiable file system. Ideally before decentralizing computing, we shall decentralize the data. IPFS filled in this area, and it has a great potential to push web to the true web3 - decentralized web. This talk will talk about what problem IPFS is trying to solve, how it solves the problem, and how to use IPFS in our applications.
https://www.arcblock.io
https://hack.arcblock.io/learning
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr
A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.
The content routing system of IPFS is the part of the architecture that discovers content in the network. It is considered by many as the most important part of the architecture, as well as the one with the most open research questions. Through this module, you will learn:
- IPFS's Content Routing Architecture,
- the protocol settings and algorithmics of IPFS’s mighty DHT,
- IPFS's gossip-based content routing approaches.
In this module, you’ll learn:
- IPFS’s Content Exchange protocols, namely, Bitswap and GraphSync,
- the message types and workflows of Bitswap and GraphSync, and
- the differences between the two protocols.
Learn all about the most foundational principle of the IPFS architecture: the IPFS Content Identifier, or CID. Through this module, you will understand:
- how IPFS addresses files,
- how IPFS transforms files into Merkle DAGs, as well as
- the detailed anatomy of a CID!
Web 3.0 joins Blockchains, the Semantic Web, and the Distributed Web in one package and creates a revolution, which will change the way we do networking! In this module, you will learn:
- the issues facing Web2.0 and the motivation for the Web3.0 movement,
- IPFS and its role in Web3.0,
- the progress so far in the IPFS ecosystem.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1RJcfss.
Juan Batiz-Benet makes a short intro of IPFS (the InterPlanetary File System), a new hypermedia distribution protocol, addressed by content and identities. He also discusses the IPLD data model and example data structures (unixfs, keychain, post). Filmed at qconsf.com.
Juan Batiz-Benet is an Independent Scientist.
ArcBlock Technical Learning Series Presents IPFS.
If there's a missing piece in current blockchain stack, that'll be a decentralized, public verifiable file system. Ideally before decentralizing computing, we shall decentralize the data. IPFS filled in this area, and it has a great potential to push web to the true web3 - decentralized web. This talk will talk about what problem IPFS is trying to solve, how it solves the problem, and how to use IPFS in our applications.
https://www.arcblock.io
https://hack.arcblock.io/learning
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformrhatr
A long time ago in a galaxy far, far away only the chosen few could deploy and operate a fully functional Hadoop cluster. Vendors were taking pride in rationalizing this experience to their customers by creating various distributions including Apache Hadoop. It all changed when Cloudera decided to support Apache Bigtop as the first 100% community driven bigdata management distribution of Apache Hadoop. Today, most major commercial distribution of Apache Hadoop are based on Bigtop. Bigtop has won the Hadoop distributions war and is offering a superset of packaged components. In this talk we will focus on practical advice of how to deploy and start operating a Hadoop cluster using Bigtop’s packages and deployment code. We will dive into the details of using packages of Hadoop ecosystem provided by Bigtop and how to build data management pipelines in support your enterprise applications.
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#chella
Audit logging is one of the most critical features in an enterprise-ready database in terms of security compliance. Furthermore, live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for serverel needs including troubleshooting or even live traffic reply
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Spark Overview
2) Hadoop Overview
3) Spark vs Hadoop
4) Why Spark Hadoop?
5) Using Hadoop With Spark
6) Use Case - Sports Analytics (NBA)
IPFS is a distribution protocol that enables the creation of completely distributed applications through content addressing. A very ambitious open source project in Go, IPFS adopts a peer-to-peer hypermedia protocol to protect against a single point of failure. This presentation aims to highlight the design and ideas of IPFS and also touches upon a real world use case.
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
After three decades of relational data modeling, everyone’s pretty comfortable with schemas, tables, and entity-relationships. As more and more Global 2000 companies choose NoSQL databases to power their Digital Economy applications, they need to think about how to best model their data. How do they move from a constrained, table-driven model to an agile, flexible data model based on JSON documents?
This webinar is intended for architects and application developers who want to learn about new JSON document data modeling approaches, techniques, and best practices. This webinar will show you how to get started building a JSON document data model, how to migrate a table-based data model to JSON documents, and how to optimize your design to enable fast query performance.
This webinar will provide practical, experience-based advice and best practices for modeling JSON documents, including:
- When to embed or not embed objects in your JSON document
- Data modeling using a practical data access pattern approach
- Indexing your JSON documents
- Querying your data using N1QL (SQL for JSON)
Working With a Real-World Dataset in Neo4j: Import and ModelingNeo4j
This webinar will cover how to work with a real world dataset in Neo4j, with a focus on how to build a graph from an existing dataset (in this case a series of JSON files). We will explore how to performantly import the data into Neo4j - both in the case of an initial import and scaling writes for your graph application. We will demonstrate different approaches for data import (neo4j-import, LOAD CSV, and using the official Neo4j drivers), and discuss when it makes to use each import technique. If you've ever asked these questions, then this webinar is for you!
- How do I design a property graph model for my domain?
- How do I use the official Neo4j drivers?
- How can I deal with concurrent writes to Neo4j?
- How can I import JSON into Neo4j?
Raft protocol has been successfully used for consistent metadata replication; however, using it for data replication poses unique challenges. Apache Ratis is a RAFT implementation targeted at high throughput data replication problems. Apache Ratis is being successfully used as a consensus protocol for data stored in Ozone (object store) and Quadra(block device) to provide data throughput that saturates the network links and disk bandwidths.
Pluggable nature of Ratis renders it useful for multiple use cases including high availability, data or metadata replication, and ensuring consistency semantics.
This talk presents the design challenges to achieve high throughput and how Apache Ratis addresses them. We talk about specific optimizations that have been implemented to minimize overheads and scale up the throughput while maintaining correctness of the consistency protocol. The talk also explains how systems like Ozone take advantage of Ratis’s implementation choices to achieve scale. We will discuss the current performance numbers and also future optimizations. MUKUL KUMAR SINGH, Staff Software Engineer, Hortonworks and LOKESH JAIN, Software Engineer, Hortonworks
In this session, you'll learn how RBD works, including how it:
Uses RADOS classes to make access easier from user space and within the Linux kernel.
Implements thin provisioning.
Builds on RADOS self-managed snapshots for cloning and differential backups.
Increases performance with caching of various kinds.
Uses watch/notify RADOS primitives to handle online management operations.
Integrates with QEMU, libvirt, and OpenStack.
In this presentation, I take a deep dive into the InfluxDB open source storage engine. More than just a single storage engine, InfluxDB is two engines in one: the first for time series data and the second, an index for metadata. I'll delve into the optimizations for achieving high write throughput, compression and fast reads for both the raw time series data and the metadata.
Module: the modular p2 p networking stack Ioannis Psaras
libp2p is the ultimate Web3.0 library of choice for decentralised process addressing. In this module, you will hear about all of libp2p’s modular and composable building blocks for P2P networking, which include:
- Transport protocols, pubsub protocols and multiplexers
- Secure channels and NAT Traversal
- Peer discovery, content- and peer-routing
KubeCon EU 2016: Kubernetes Storage 101KubeAcademy
You have deployed your application on Kube and now you want to actually do something permanent with it?? You will need STORAGE.
This talk will be a good introduction to using storage in Kubernetes. It will cover the use of EmptyDir, HostPath and Persistent Storage options. How to configure and use each type. This talk will also discuss the security features for storage in the open source OpenShift project.
Sched Link: http://sched.co/6BcS
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
In this talk we present Zalando's microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing for near-real time business intelligence.
Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization.
We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach - Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams.
We no longer live in a world of static data sets, but are instead confronted with an endless stream of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch.
With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL.
Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements.
On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability.
Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRestPGDay.Amsterdam
Ever heard of Point-in-time recovery? pgBackRest is an awsome tool to handle backups, restores and even helps you build streaming replication ! This talk will introduce the tool, its basic features and how to use it.
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Everything you always wanted to know about Distributed databases, at devoxx l...javier ramirez
Everything you always wanted to know about Distributed databases, at devoxx london, by javier ramirez, teowaki.
Basic concepts of distributed systems, such as consensus, gossip and infection protocols, vector clocks, sharding storage, so you can create highly available distributed systems
This talk is from Distributed Data Summit SF 2018 - http://distributeddatasummit.com/2018-sf/sessions#chella
Audit logging is one of the most critical features in an enterprise-ready database in terms of security compliance. Furthermore, live traffic troubleshooting is critical for operators to troubleshoot production issues quickly. While past versions have lacked these critical features, the Cassandra team understood the need for better solutions and in the upcoming release of Cassandra both of these features now come out of the box which makes Cassandra even more awesome to work with. Cassandra now supports Audit logging and query logging as part of C* itself. As part of this talk, audience will learn about how to enable, configure, and tune audit logging for their C* clusters and how to log live traffic/queries for serverel needs including troubleshooting or even live traffic reply
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!
This Edureka Spark Hadoop Tutorial will help you understand how to use Spark and Hadoop together. This Spark Hadoop tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Spark Overview
2) Hadoop Overview
3) Spark vs Hadoop
4) Why Spark Hadoop?
5) Using Hadoop With Spark
6) Use Case - Sports Analytics (NBA)
IPFS is a distribution protocol that enables the creation of completely distributed applications through content addressing. A very ambitious open source project in Go, IPFS adopts a peer-to-peer hypermedia protocol to protect against a single point of failure. This presentation aims to highlight the design and ideas of IPFS and also touches upon a real world use case.
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
After three decades of relational data modeling, everyone’s pretty comfortable with schemas, tables, and entity-relationships. As more and more Global 2000 companies choose NoSQL databases to power their Digital Economy applications, they need to think about how to best model their data. How do they move from a constrained, table-driven model to an agile, flexible data model based on JSON documents?
This webinar is intended for architects and application developers who want to learn about new JSON document data modeling approaches, techniques, and best practices. This webinar will show you how to get started building a JSON document data model, how to migrate a table-based data model to JSON documents, and how to optimize your design to enable fast query performance.
This webinar will provide practical, experience-based advice and best practices for modeling JSON documents, including:
- When to embed or not embed objects in your JSON document
- Data modeling using a practical data access pattern approach
- Indexing your JSON documents
- Querying your data using N1QL (SQL for JSON)
Working With a Real-World Dataset in Neo4j: Import and ModelingNeo4j
This webinar will cover how to work with a real world dataset in Neo4j, with a focus on how to build a graph from an existing dataset (in this case a series of JSON files). We will explore how to performantly import the data into Neo4j - both in the case of an initial import and scaling writes for your graph application. We will demonstrate different approaches for data import (neo4j-import, LOAD CSV, and using the official Neo4j drivers), and discuss when it makes to use each import technique. If you've ever asked these questions, then this webinar is for you!
- How do I design a property graph model for my domain?
- How do I use the official Neo4j drivers?
- How can I deal with concurrent writes to Neo4j?
- How can I import JSON into Neo4j?
Raft protocol has been successfully used for consistent metadata replication; however, using it for data replication poses unique challenges. Apache Ratis is a RAFT implementation targeted at high throughput data replication problems. Apache Ratis is being successfully used as a consensus protocol for data stored in Ozone (object store) and Quadra(block device) to provide data throughput that saturates the network links and disk bandwidths.
Pluggable nature of Ratis renders it useful for multiple use cases including high availability, data or metadata replication, and ensuring consistency semantics.
This talk presents the design challenges to achieve high throughput and how Apache Ratis addresses them. We talk about specific optimizations that have been implemented to minimize overheads and scale up the throughput while maintaining correctness of the consistency protocol. The talk also explains how systems like Ozone take advantage of Ratis’s implementation choices to achieve scale. We will discuss the current performance numbers and also future optimizations. MUKUL KUMAR SINGH, Staff Software Engineer, Hortonworks and LOKESH JAIN, Software Engineer, Hortonworks
In this session, you'll learn how RBD works, including how it:
Uses RADOS classes to make access easier from user space and within the Linux kernel.
Implements thin provisioning.
Builds on RADOS self-managed snapshots for cloning and differential backups.
Increases performance with caching of various kinds.
Uses watch/notify RADOS primitives to handle online management operations.
Integrates with QEMU, libvirt, and OpenStack.
In this presentation, I take a deep dive into the InfluxDB open source storage engine. More than just a single storage engine, InfluxDB is two engines in one: the first for time series data and the second, an index for metadata. I'll delve into the optimizations for achieving high write throughput, compression and fast reads for both the raw time series data and the metadata.
Module: the modular p2 p networking stack Ioannis Psaras
libp2p is the ultimate Web3.0 library of choice for decentralised process addressing. In this module, you will hear about all of libp2p’s modular and composable building blocks for P2P networking, which include:
- Transport protocols, pubsub protocols and multiplexers
- Secure channels and NAT Traversal
- Peer discovery, content- and peer-routing
KubeCon EU 2016: Kubernetes Storage 101KubeAcademy
You have deployed your application on Kube and now you want to actually do something permanent with it?? You will need STORAGE.
This talk will be a good introduction to using storage in Kubernetes. It will cover the use of EmptyDir, HostPath and Persistent Storage options. How to configure and use each type. This talk will also discuss the security features for storage in the open source OpenShift project.
Sched Link: http://sched.co/6BcS
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Zalando Technology
In this talk we present Zalando's microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing for near-real time business intelligence.
Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization.
We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach - Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams.
We no longer live in a world of static data sets, but are instead confronted with an endless stream of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch.
With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL.
Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements.
On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability.
Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRestPGDay.Amsterdam
Ever heard of Point-in-time recovery? pgBackRest is an awsome tool to handle backups, restores and even helps you build streaming replication ! This talk will introduce the tool, its basic features and how to use it.
Basic Introduction to Cassandra with Architecture and strategies.
with big data challenge. What is NoSQL Database.
The Big Data Challenge
The Cassandra Solution
The CAP Theorem
The Architecture of Cassandra
The Data Partition and Replication
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Everything you always wanted to know about Distributed databases, at devoxx l...javier ramirez
Everything you always wanted to know about Distributed databases, at devoxx london, by javier ramirez, teowaki.
Basic concepts of distributed systems, such as consensus, gossip and infection protocols, vector clocks, sharding storage, so you can create highly available distributed systems
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre ZembStreamNative
OVHcloud is the biggest European cloud provider. From dedicated servers to Managed Kubernetes, from VMware® based Hosted Private Cloud to OpenStack-based Public Cloud, we have over 1.4 million customers worldwide.
Internally, we have been running Apache Kafka for years, and despite all the skills obtained operating multiples clusters with millions of messages per second, we decided to shift and build the foundation of our 'topic-as-a-service' product called ioStream on Apache Pulsar.
In this talk, you will have the insights of why we decided to use Apache Pulsar instead of Apache Kafka as the core of ioStream. We will tell you our journey to use Apache Pulsar, from our deployments to the management, what did work and what did not.
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
This session is based on a full-day big data workshop delivered to 40 database professionals at the German User Group (DOAG) conference in 2016, garnering fantastic feedback (www.munzandmore.com/2016/ora/big-data-cloudera-oracle-training-feedback-doag). There are zillions of open source big data projects these days. In the session, you will learn about the core principles of four key technologies that are most often used in projects: Hadoop, Spark, Hive, and Kafka. The presentation first explains the fundamentals of those four big data technologies. Then you will see how to take the first easy steps into the big data world yourself, with Oracle Big Data Cloud Service and Oracle Event Hub Cloud Service live demos.
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...HostedbyConfluent
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrumgaard | Current 2022
At home, I monitor the temperature, humidity, gas levels, ozone, air quality, and other features around my desk.
Let's bring this to the different spots around the conference including lunch tables, vendor booths, hotel rooms, and more. I need to know about these readings now, not when I get back home from the conference. We need to get these sensor readings immediately in case we need to turn on a fan or move to another area. We will also see if my talk produces a lot of hot air!?!??
My setup is pretty simple, a raspberry pi, a breakout garden sensor mount, and as many sensors as I am willing to fly to Austin. The software stack is Python and Java, Apache Pulsar, MQTT, HTML, JQuery, and Apache Kafka.
https://dzone.com/articles/five-sensors-real-time-with-pulsar-and-python-on-a
https://www.datainmotion.dev/2022/04/flip-py-pi-enviroplus-using-apache.html
https://dzone.com/articles/pulsar-in-python-on-pi
(Current22) Let's Monitor The Conditions at the ConferenceTimothy Spann
(Current22) Let's Monitor The Conditions at the Conference
Let's Monitor The Conditions at the Conference
Session Time11:15 am - 12:00 pm Session DateWednesday, 5 October 2022 Session Type:In-Person Location:Ballroom G
Session Description:
At home, I monitor the temperature, humidity, gas levels, ozone, air quality, and other features around my desk. Let's bring this to the different spots around the conference including lunch tables, vendor booths, hotel rooms, and more. I need to know about these readings now, not when I get back home from the conference. We need to get these sensor readings immediately in case we need to turn on a fan or move to another area. We will also see if my talk produces a lot of hot air!? My setup is pretty simple, a raspberry pi, a breakout garden sensor mount, and as many sensors as I am willing to fly to Austin. The software stack is Python and Java, Apache Pulsar, MQTT, HTML, JQuery, and Apache Kafka.
Timothy Spann, StreamNative
Developer Advocate
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC.
Hail hydrate! from stream to lake using open sourceTimothy Spann
(VIRTUAL) Hail Hydrate! From Stream to Lake Using Open Source - Timothy J Spann, StreamNative
https://osselc21.sched.com/event/lAPi?iframe=no
A cloud data lake that is empty is not useful to anyone. How can you quickly, scalably and reliably fill your cloud data lake with diverse sources of data you already have and new ones you never imagined you needed. Utilizing open source tools from Apache, the FLiP stack enables any data engineer, programmer or analyst to build reusable modules with low or no code. FLiP utilizes Apache NiFi, Apache Pulsar, Apache Flink and MiNiFi agents to load CDC, Logs, REST, XML, Images, PDFs, Documents, Text, semistructured data, unstructured data, structured data and a hundred data sources you could never dream of streaming before. I will teach you how to fish in the deep end of the lake and return a data engineering hero. Let's hope everyone is ready to go from 0 to Petabyte hero.
https://osselc21.sched.com/event/lAPi/virtual-hail-hydrate-from-stream-to-lake-using-open-source-timothy-j-spann-streamnative
Captial One: Why Stream Data as Part of Data Transformation?ScyllaDB
Event-driven architectures are increasingly part of a complete data transformation solution. Learn how to employ Apache Kafka, Cloud Native Computing Foundation’s NATS, Amazon SQS, or other message queueing technologies. This talks covers the details of each, their advantages and disadvantages and how to select the best for your company’s needs.
The world has changed and having one huge server won’t do the job anymore, when you’re talking about vast amounts of data, growing all the time the ability to Scale Out would be your savior. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
This lecture will be about the basics of Apache Spark and distributed computing and the development tools needed to have a functional environment.
Big data conference europe real-time streaming in any and all clouds, hybri...Timothy Spann
Biography
Tim Spann is a Principal DataFlow Field Engineer at Cloudera where he works with Apache NiFi, MiniFi, Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
Talk
Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Today, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the scale and as events arrive.
Tools:
Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, DJL.ai Apache MXNet.
References:
https://www.datainmotion.dev/2019/11/introducing-mm-flank-apache-flink-stack.html
https://www.datainmotion.dev/2019/08/rapid-iot-development-with-cloudera.html
https://www.datainmotion.dev/2019/09/powering-edge-ai-for-sensor-reading.html
https://www.datainmotion.dev/2019/05/dataworks-summit-dc-2019-report.html
https://www.datainmotion.dev/2019/03/using-raspberry-pi-3b-with-apache-nifi.html
Source Code: https://github.com/tspannhw/MmFLaNK
FLiP Stack
StreamNative
Deploy Secure and Scalable Services Across Kubernetes Clusters with NATSNATS
Services and Streams are the cornerstones of any modern distributed architecture. Communications and observability of modern systems have become just as important as the deployment of the components themselves. In this talk maintainers of the NATS projectwill create a service using NATS as the communication technology. They will show how NATS allows a service application to utilize cutting edge security with the ability to scale up and down, across multiple Kubernetes clusters and cloud deployments. This will be completely observable, with no code changes from the demo code base to global deployment. NATS allows cutting edge modern systems to be built without the additional complexity of load balancers, proxies or sidecars. NATS allows radically easy yet secure deployments across multiple k8s clusters, in any cloud or on-premise environment.
HPC control systems are evolving into the future. This presentation looks at where this evolution may lead, and describes how the control system of the future might be constructed.
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
Apache Kafka is the most used data streaming broker by companies. It could manage millions of messages easily and it is the base of many architectures based in events, micro-services, orchestration, ... and now cloud environments. OpenShift is the most extended Platform as a Service (PaaS). It is based in Kubernetes and it helps the companies to deploy easily any kind of workload in a cloud environment. Thanks many of its features it is the base for many architectures based in stateless applications to build new Cloud Native Applications. Strimzi is an open source community that implements a set of Kubernetes Operators to help you to manage and deploy Apache Kafka brokers in OpenShift environments.
These slides will introduce you Strimzi as a new component on OpenShift to manage your Apache Kafka clusters.
Slides used at OpenShift Meetup Spain:
- https://www.meetup.com/es-ES/openshift_spain/events/261284764/
Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble
"Hadoop Ecosystem and Low Latency Streaming Architecture" was presented by Vijay Mandava and Lan Jiang to Detroit Java User Group on 3/23/2015. It covers the basic introduction of Hadoop Ecosystem and then focus on the low latency streaming architecture, including frameworks such as Flume, Kafka and Storm.
Gen Z and the marketplaces - let's translate their needsLaura Szabó
The product workshop focused on exploring the requirements of Generation Z in relation to marketplace dynamics. We delved into their specific needs, examined the specifics in their shopping preferences, and analyzed their preferred methods for accessing information and making purchases within a marketplace. Through the study of real-life cases , we tried to gain valuable insights into enhancing the marketplace experience for Generation Z.
The workshop was held on the DMA Conference in Vienna June 2024.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your website’s performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
Ready to Unlock the Power of Blockchain!Toptal Tech
Imagine a world where data flows freely, yet remains secure. A world where trust is built into the fabric of every transaction. This is the promise of blockchain, a revolutionary technology poised to reshape our digital landscape.
Toptal Tech is at the forefront of this innovation, connecting you with the brightest minds in blockchain development. Together, we can unlock the potential of this transformative technology, building a future of transparency, security, and endless possibilities.
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
2. a
CONTENT ADDRESSING CONTENT DISCOVERY
& ROUTING
CONTENT EXCHANGE
● Chunking
● Linking Chunks
in Merkle DAGs
● From Data to
Data Structures with IPLD
● Anatomy of the IPFS CID
● Routing & Provider
Records
● DHT-based Routing
● Gossip-based Routing
● Bitswap
● GraphSync
MUTABLE NAMES &
MESSAGE DELIVERY
● Dynamic Data
● IPNS
● PubSub
● CRDTs
IPFS Components
3. Agenda
➔ Motivation
➔ Messaging Layer
◆ IPFS/libp2p PubSub
◆ PubSub Routers (Floodsub & Gossipsub)
➔ Mutable Pointers
◆ Self-Certified Names (IPNS)
◆ IPNS over DHT, PubSub & DNS
➔ Type Systems for Distributed Applications
◆ Conflict Free Replicated Data Types (CRDT)
➔ Access Controls for Distributed Applications
◆ Cryptographic ACL / Capabilities System
➔ Examples of tools and libraries available in the ecosystem
7. ➔ IPFS gives us immutable identifiers (CIDs), however, Rich Web Apps need dynamic
content that changes rapidly
➔ Ability to create, edit and update existing data, from text to images, videos and so on.
➔ Propagating updates to other participants, created a shared view into the latest state
of a dataset
➔ Grant and revoke read and write permissions to data
➔ Support multiple interaction patterns, from one-to-many (e.g. Blog), many-to-many (e.g.
Social Networks) and many-to-one (e.g. RSS/feed subscriptions)
What is Mutable Content about?
12. Benefits
➔ Layer for distributing updates
➔ Message Oriented Comms -> Simple to program
➔ Support for different topologies -> Different types
of interaction patterns supported
➔ Loose Coupling -> Separation of Concerns
➔ Scalable -> Adapt as your network grows
A Messaging Layer supported by
P2P PubSub is key for supporting
different interaction patterns
Challenges
➔ Permission-less network -> can’t control who
join/leaves
➔ Network topology is bottom-up (e.g. brokerless)
➔ Network Churn
➔ Optimizations for Latency/Bandwidth/Delivery
Guarantees are about tradeoffs, not one-size fits all
13. ➔ Topic based interface
➔ Design Goals
◆ Reliability: All messages get delivered to all peers
subscribed to the topic.
◆ Speed: Messages are delivered quickly.
◆ Efficiency: The network is not flooded with excess
copies of messages.
◆ Resilience: Peers can join and leave the network
without disrupting it. There is no central point of
failure.
◆ Scale: Topics can have enormous numbers of
subscribers and handle a large throughput of
messages.
◆ Simplicity: The system is simple to understand and
implement. Each peer only needs to remember a
small amount of state.
IPFS/libp2p PubSub
➔ Parameterizable
➔ Two main implementations
◆ FloodSub
◆ Gossipsub
14. ➔ Design
◆ Simplest possible design
◆ Ambient Peer Discovery (e.g. IPFS Main Network’s DHT, MulticastDNS and
other systems)
◆ Routing is achieved by Flooding
➔ Pros
◆ Robust delivery guarantees, even in the presence of high
network churn
◆ Minimum latency delivery
➔ Cons
◆ Huge bandwidth overhead
(duplicate messages received)
◆ Unbounded degree flooding
(wasterful)
FloodSub
Time
16. ➔ Bleeding edge P2P PubSub
➔ Design
◆ Self Stabilizing algorithm
◆ 2 Networks: Metadata (control plane) & Message (data plane)
◆ Nodes establish local meshes with reciprocal agreements on subscriptions
◆ Degree bounded by default to 6
◆ Nodes send Gossip with
● GRAFT & PRUNE to establish peering agreements
● Gossip about what messages have been delivered and seen
◆ Eager push/Lazy pull model
◆ Scoring function to reward benevolent peers and penalize bad behaving peers
◆ Support for protocol extensions
➔ Pros
◆ Resilient to Churn
◆ Resilient to Sybil, Eclipse & Spam Attacks
◆ Minimizes Bandwidth usage
➔ Cons
◆ For small networks, it doesn’t achieve the same minimum latency guarantees as
Floodsub
Gossipsub with
hardening extensions
17. ➔ Simulation
◆ 100 peers
◆ 5 msg/s
◆ Run for 2s
◆ Time expansion 10x
for visualisation
Gossipsub with
hardening extensions
18. ➔ Evaluation Performed with over 10 000 peers
➔ Demonstrated successfully how Gossipsub is resilient to a variety of attacks
➔ The implementation of Gossipsub and the code used for the evaluation is all Open Source
➔ Paper Publish with evaluation and Architecture
Evaluation Report & Paper
19. ➔ Documents all known challenges & existing solutions
➔ Invitation for collaboration in solving the challenges at new lengths
Open Problem Statement
22. Goals
➔ No central central coordinator to distribute
updates
➔ Ability to certify that the updates originate
from the author
Decentralized updates are a
requirement for several use cases
Challenges
➔ Different expectations of delivery depending on
the application
➔ Scalability (# of nodes) and frequency
(# of updates)
23. ➔ Self-Certified Naming
➔ Design
◆ Verifiability:
◆ Versatility: A name can point to one or more
immutable addresses
◆ Transport independent: Not be tied to a specific
method of distribution.
➔ Names (aka Pointers) are referenced by the
PublicKey of the signer vs. the Content of the
name
IPNS, the
InterPlanetary NameSystem
➔ Several implementations:
◆ IPNS over DHT
◆ IPNS over PubSub
◆ IPNS over DNS
◆ IPNS over ETH Domains
◆ IPNS over Namecoin (one of the first
implementations)
◆ IPNS over QRCodes? NFC? BLE Beacons?
Multiple possibilities
24. ➔ Each name is a Public/Private Key Pair
➔ The Private Key is used to signed statements (aka records)
containing the state of the name
➔ The Public Key is used by clients for
◆ Discovering the new records published
◆ Certify the record update was indeed issued by the author
➔ Every time the owner of the name wants to issue an update, it
simply creates a new record, signs it and broadcasts it over
one of the selected transports (e.g DHT, PubSub, DNS, etc)
➔ Human Readable naming is enabled via a registrar. Examples:
◆ DNS some-domain.com -> TXT Record /ipns/Qm_CID of
Public Key.. -> /ipfs/Qm_CID of content
◆ UnstoppableDomains some-domain.crypto -> /ipns/Qm_CID
of Public Key.. -> /ipfs/Qm_CID of content
IPNS Architecture
25. ➔ Documents all known challenges & existing solutions
➔ Invitation for collaboration in solving the challenges at new lengths
Open Problem Statements
27. ➔ “A distributed system is one in which the failure of a computer you
didn't even know existed can render your own computer unusable.”
Leslie Lamport
➔ Synchronization errors happen
◆ Simultaneous Writers
◆ Propagation Delay
➔ Distributed Consensus is not a trivial problem
➔ There must be a better way! 💡
Mutable Pointers are not enough
for Distributed Applications
28. Type Systems for
Distributed Apps
Motivation
Messaging Layer
Mutable Pointers
Type Systems for
Distributed Apps
Access Controls for
Distributed Apps
Mutable Content
Tools & Libraries
29. Goals
➔ No need for central coordinator and avoid
central points of failure and/or chokepoints
(e.g. case for Operational Transforms)
➔ Support multiple times of collaboration
Convergent state and/or
Consensus over Distributed Data
Challenges
➔ Avoid resource intensive and slow constructions
(e.g. PBFT, Nakamoto Consensus, etc)
➔ Support Web scale
➔ Fast sync, Snapshotting and Garbage Collection
30. ➔ Fairly new and large field of distributed
systems research
➔ Design
◆ Merge function is defined a priority and
distributed to all participants
◆ Independent of the order of the events received,
all participants will know how to merge the
events and converge on the same state
➔ CRDTs are not a one-size fits all data
structure, instead they are many data
structures that fit specific use cases:
◆ G-Set (Grow-only Set)
◆ 2P-Set (Two-Phase Set)
◆ LWW-Element-Set (Last-Write-Wins-Element-Set)
◆ OR-Set (Observed-Remove Set)
◆ Sequence/Ordered Set
◆ Many more..
Conflict-Free Replicated Data
Types (CRDTs)
➔ Pros
◆ Order of reception is not important, as long as the
events arrive
◆ Does not require any live interactivity with all
participants
◆ Can be used to build all sorts of applications, from
collaborative doc editors, chats and much more
➔ Cons
◆ Can generate a lot of non-trivially garbage collectable
objects for long running programs
32. Access Controls for
Distributed Apps
Motivation
Messaging Layer
Mutable Pointers
Type Systems for
Distributed Apps
Access Controls for
Distributed Apps
Mutable Content
Tools & Libraries
33. Goals
➔ Grant and Remove access to data
➔ No need for central coordinator and avoid
central points of failure and/or chokepoints
(e.g. case for Operational Transforms)
➔ Support multiple times of collaboration
Access Controls
Challenges
➔ Avoid resource intensive and slow constructions
(e.g. PBFT, Nakamoto Consensus, etc) to agree
on the ACL
➔ Support Web scale
➔ Fast sync, Snapshotting and Garbage Collection
34. ➔ Private collaborations over data in a Public Network
➔ Design
◆ Symmetric Keys give Read Access
● Only the ones with the key can decrypt and therefore
read
◆ Asymmetric Keys give Write Access
● Similar to how IPNS gives ownership of who is trusted to
updated the Record
➔ Pros
◆ Fine grain control of Read & Write Access
◆ Authorship is verifiable
◆ Does not require a centralized party for coordination
➔ Cons
◆ Require nodes to have a way to synchronize on the latest
state of ACL to ensure that only nodes allowed receive the
latest updates
Cryptographic ACLs
aka Capabilities Systems
37. ➔ Open Source Collaborative Text Editor
➔ IPNS to get the latest release the App
➔ IPFS to load the contents
➔ Public Key Pair & Symmetric Key at the URL
➔ CRDT to create the shared state
➔ PubSub to distribute messages
➔ Features
◆ Real-time collaboration
◆ Conflict-Free
PeerPad as an example of the
vertical integration