This document discusses building a data platform in the cloud. It covers the evolution of data platforms from monolithic architectures to distributed event-driven architectures using a data lake. Key aspects of a cloud data platform include collecting and persisting all data in a data lake for standardized access, near real-time processing using streaming technologies, and building the platform using either fully managed or DIY/hybrid approaches on AWS. Design principles focus on event-driven separation of data producers and consumers and choosing the right technology for the problem.
Why you really want SQL in a Real-Time Enterprise EnvironmentVoltDB
The SQL/NoSQL/NewSQL discussion isn’t a new one, and it continues to be a confused and confusing debate internally at many companies. In this webinar, VoltDB Director of Product Marketing, Dennis Duckworth, and David Rolfe, Director of Solutions Engineering EMEA, at VoltDB will discuss the importance and portability of using a standard, expressive programming language, and how ACID impacts concurrent actions and makes some things most of you are trying to do much easier. They also cover how SQL operational databases deliver correct answers - and why NoSQL can’t without significant cost (like, in performance). Last they will outline how VoltDB customers dramatically reduced the number of servers needed when they switched from open source DIY NoSQL stacks to VoltDB.
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. A kind of “microservices” for the data-centric world. While the data mesh is not technology-specific as a pattern, the building of systems that adopt and implement data mesh principles have a relatively long history under different guises.
In this talk, we share our recommendations and picks of what every developer should know about building a streaming data mesh with Kafka. We introduce the four principles of the data mesh: domain-driven decentralization, data as a product, self-service data platform, and federated governance. We then cover topics such as the differences between working with event streams versus centralized approaches and highlight the key characteristics that make streams a great fit for implementing a mesh, such as their ability to capture both real-time and historical data. We’ll examine how to onboard data from existing systems into a mesh, modelling the communication within the mesh, how to deal with changes to your domain’s “public” data, give examples of global standards for governance, and discuss the importance of taking a product-centric view on data sources and the data sets they share.
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...VoltDB
This webinar, presented by VoltDB's Dennis Duckworth and David Rolfe, explains how an in-memory operational database can be used to deliver speed and performance for mobile, financial services, media & advertising, telco, retail, or IoT, use cases while providing all the durability and persistence you desire.
Using druid for interactive count distinct queries at scaleItai Yaffe
At NMC (Nielsen Marketing Cloud) we need to present to our clients the number of unique users who meet a given criteria. The condition is typically a set-theoretic expression over a stream of events for a given time range. Historically, we have used ElasticSearch to answer these types of questions, however, we have encountered major scaling issues. In this presentation we will detail the journey of researching, benchmarking and productionizing a new technology, Druid, with DataSketches, to overcome the limitations we were facing
Dr. Michael Stonebraker will share his "one-size-doesn’t-fit-all" perspective when it comes to picking the right tool for the job. He will explain the fast data stack, why traditional RDBMS’s fall short and how a modern in-memory SQL, ACID compliant, database with a scale-out architecture is the right choice for enabling fast data applications. Then John Hugg provides the “proof in the pudding” with a step-by-step review of his Unique Devices application, which performs real-time analytics on fast moving data. It's a representative implementation of the speed layer in the Lambda Architecture with the logic captured in just 30 lines of code.
Why you really want SQL in a Real-Time Enterprise EnvironmentVoltDB
The SQL/NoSQL/NewSQL discussion isn’t a new one, and it continues to be a confused and confusing debate internally at many companies. In this webinar, VoltDB Director of Product Marketing, Dennis Duckworth, and David Rolfe, Director of Solutions Engineering EMEA, at VoltDB will discuss the importance and portability of using a standard, expressive programming language, and how ACID impacts concurrent actions and makes some things most of you are trying to do much easier. They also cover how SQL operational databases deliver correct answers - and why NoSQL can’t without significant cost (like, in performance). Last they will outline how VoltDB customers dramatically reduced the number of servers needed when they switched from open source DIY NoSQL stacks to VoltDB.
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. A kind of “microservices” for the data-centric world. While the data mesh is not technology-specific as a pattern, the building of systems that adopt and implement data mesh principles have a relatively long history under different guises.
In this talk, we share our recommendations and picks of what every developer should know about building a streaming data mesh with Kafka. We introduce the four principles of the data mesh: domain-driven decentralization, data as a product, self-service data platform, and federated governance. We then cover topics such as the differences between working with event streams versus centralized approaches and highlight the key characteristics that make streams a great fit for implementing a mesh, such as their ability to capture both real-time and historical data. We’ll examine how to onboard data from existing systems into a mesh, modelling the communication within the mesh, how to deal with changes to your domain’s “public” data, give examples of global standards for governance, and discuss the importance of taking a product-centric view on data sources and the data sets they share.
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...VoltDB
This webinar, presented by VoltDB's Dennis Duckworth and David Rolfe, explains how an in-memory operational database can be used to deliver speed and performance for mobile, financial services, media & advertising, telco, retail, or IoT, use cases while providing all the durability and persistence you desire.
Using druid for interactive count distinct queries at scaleItai Yaffe
At NMC (Nielsen Marketing Cloud) we need to present to our clients the number of unique users who meet a given criteria. The condition is typically a set-theoretic expression over a stream of events for a given time range. Historically, we have used ElasticSearch to answer these types of questions, however, we have encountered major scaling issues. In this presentation we will detail the journey of researching, benchmarking and productionizing a new technology, Druid, with DataSketches, to overcome the limitations we were facing
Dr. Michael Stonebraker will share his "one-size-doesn’t-fit-all" perspective when it comes to picking the right tool for the job. He will explain the fast data stack, why traditional RDBMS’s fall short and how a modern in-memory SQL, ACID compliant, database with a scale-out architecture is the right choice for enabling fast data applications. Then John Hugg provides the “proof in the pudding” with a step-by-step review of his Unique Devices application, which performs real-time analytics on fast moving data. It's a representative implementation of the speed layer in the Lambda Architecture with the logic captured in just 30 lines of code.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...confluent
(Bruno Simic, Solutions Engineer, Couchbase)
Breakout during Confluent’s streaming event in Munich. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at First Derivatives
About the Author:
James is Senior Vice President, Fast Data Solutions at Kx where he has worked as a developer since 2009. In his career to date, he has worked in the algorithmic trading space at many of the world’s top financial institutions using Kx - a low latency technology for analysing time series data. He is a certified Professional Risk Manager and holds a masters in Quantitative Finance from University College Dublin. In recent years he has built systems for clients ranging from start-ups to blue chip companies in data intensive industries such as pharma, utilities and telco.
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...VoltDB
In this webinar Ryan Betts, CTO at VoltDB, explains why streaming aggregation is a key to streaming analytics. He will also address how SQL can be used in combination with streaming aggregation, and the benefits of up-to-date analytics for per-event transactions and insights. You can listen to the webinar here: http://learn.voltdb.com/WRSQLDatabase.html
Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014.
It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
Big data serving: Processing and inference at scale in real timeItai Yaffe
Jon Bratseth (VP Architect) @ Verizon Media:
The big data world has mature technologies for offline analysis and learning from data, but have lacked options for making data-driven decisions in real time.
When it is sufficient to consider a single data point model servers such as TensorFlow serving can be used but in many cases you want to consider many data points to make decisions.
This is a difficult engineering problem combining state, distributed algorithms and low latency, but solving it often makes it possible to create far superior solutions when applying machine learning.
This talk will explain why this is a hard problem, show the advantages of solving it, and introduce the open source Vespa.ai platform which is used to implement such solutions in some of the largest scale problems in the world including the world's third largest ad serving system.
Symantec: Cassandra Data Modelling techniques in actionDataStax Academy
Our product presents an aggregated view of metadata collected for billions of objects (files, emails, sharepoint objects etc.). We used Cassandra to store those billions of objects along with aggregated view of that metadata. Customers can analyse the corpus of data in real time by searching in completely flexible way i.e. be able to get summary aggregates for many billions of objects, and then be able to further drill down to items by filtering using various facets of the metadata. We achieve this using a combination of Cassandra and ElasticSearch. This presentation will talk about various data modelling techniques we use to aggregate and then further summarise all that metadata and be able to search the summary in real t
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Databricks
Fivetran makes it easy to automate data ingestion particularly for operational data sources such as Salesforce, Zendesk, and Oracle Eloqua, no matter how source schemas and APIs change. Achieving historical analysis is cumbersome, time-consuming, and costly to build and maintain manually. A common approach is to include snapshots, which only take into account changes at a given time. Plus, the additional storage requirements can become unwieldy to manage. Type 2 Slowly Changing Dimension (SCD) allows you to track any change at any point in time. This session shows how Fivetran History Mode, which uses Type 2 SCD, can be easily configured and then switched on with 1-click and synchronized for a desired time period. This accelerates time to insights, making it easy to both automate data ingestion and historical analysis.
The role of databases in modern application developmentMariaDB plc
The rise of serverless microservices, event-driven application architecture and full-stack development with JavaScript and the MEAN stack is changing what application developers need from databases – and how they interact with them. In this session, MariaDB's Thomas Boyd discusses recent advancements in application development and architecture and explains how MariaDB supports them.
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScyllaDB
Will LaForest is the Public Sector CTO for Confluent. In his current position, Will evangelizes the benefits of Apache Kafka, event-driven data in motion architecture, and open-source software is addressing mission challenges in the Government. He has spent 25 years wrangling data at massive scale. His technical career spans diverse areas from software engineering, NoSQL, data science, cloud computing, machine learning, and building statistical visualization software but began with code slinging at DARPA as a teenager. Will holds degrees in mathematics and physics from the University of Virginia.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Our journey with druid - from initial research to full production scaleItai Yaffe
Here at the Nielsen Marketing Cloud we use druid.io (http://druid.io/) as one of our main data stores, both for simple counts and for approximate count-distinct (DataSketches).
It’s been more than a year since we started using it, injecting billions of events each day to multiple druid clusters for different use-cases.
In this meet-up, we will share our journey, the challenges we had, the way we overcame them (at least most of them) and the steps we made to optimize the process around Druid to keep the solution cost effective.
Before diving into Druid, we will briefly present our data pipeline architecture, starting from the front-end serving system, deployed in number of geo-locations, to a centralized Kafka cluster in the cloud, and give some examples of the different processes that consume from Kafka and feed our different data sources.
NoSQL no more: SQL on Druid with Apache Calcitegianmerlino
Druid is an analytics-focused, distributed, scale-out data store. Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Up until version 0.10, Druid could only be queried in a JSON-based language that many users found unfamiliar.
Enter Apache Calcite. It includes an industry-standard SQL parser, validator, and JDBC driver, as well as a cost-based relational optimizer. Calcite bills itself as “the foundation for your next high-performance database” and is used by Hive, Drill, and a variety of other projects. Druid uses Calcite to power Druid SQL, a standards-based query API that vaults Druid out of the NoSQL world and into the SQL world.
Gian Merlino offers an overview of Druid SQL and explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
The recent boom in big data processing and democratization of the big data space has been enabled by the fact that most of the concepts originated in the research labs of companies such as Google, Amazon, Yahoo and Facebook are now available as open source. Technologies such as Hadoop, Cassandra let businesses around the world to become more data driven and tap into their massive data feeds to mine valuable insights.
At the same time, we are still at a certain stage of the maturity curve of these new big data technologies and of the entire big data technology stack. Many of the technologies originated from a particular use case and attempts to apply them in a more generic fashion are hitting the limits of their technological foundations. In some areas, there are several competing technologies for the same set of use cases, which increases risks and costs of big data implementations.
We will show how GoodData solves the entire big data pipeline today, starting from raw data feeds all the way up to actionable business insights. All this provided as a hosted multi-tenant environment letting its customers to solve their particular analytical use case or many analytical use cases for thousands of their customers all using the same platform and tools while abstracting them away from the technological details of the big data stack.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...confluent
(Bruno Simic, Solutions Engineer, Couchbase)
Breakout during Confluent’s streaming event in Munich. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at First Derivatives
About the Author:
James is Senior Vice President, Fast Data Solutions at Kx where he has worked as a developer since 2009. In his career to date, he has worked in the algorithmic trading space at many of the world’s top financial institutions using Kx - a low latency technology for analysing time series data. He is a certified Professional Risk Manager and holds a masters in Quantitative Finance from University College Dublin. In recent years he has built systems for clients ranging from start-ups to blue chip companies in data intensive industries such as pharma, utilities and telco.
How to Build Real-Time Streaming Analytics with an In-memory, Scale-out SQL D...VoltDB
In this webinar Ryan Betts, CTO at VoltDB, explains why streaming aggregation is a key to streaming analytics. He will also address how SQL can be used in combination with streaming aggregation, and the benefits of up-to-date analytics for per-event transactions and insights. You can listen to the webinar here: http://learn.voltdb.com/WRSQLDatabase.html
Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014.
It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
Big data serving: Processing and inference at scale in real timeItai Yaffe
Jon Bratseth (VP Architect) @ Verizon Media:
The big data world has mature technologies for offline analysis and learning from data, but have lacked options for making data-driven decisions in real time.
When it is sufficient to consider a single data point model servers such as TensorFlow serving can be used but in many cases you want to consider many data points to make decisions.
This is a difficult engineering problem combining state, distributed algorithms and low latency, but solving it often makes it possible to create far superior solutions when applying machine learning.
This talk will explain why this is a hard problem, show the advantages of solving it, and introduce the open source Vespa.ai platform which is used to implement such solutions in some of the largest scale problems in the world including the world's third largest ad serving system.
Symantec: Cassandra Data Modelling techniques in actionDataStax Academy
Our product presents an aggregated view of metadata collected for billions of objects (files, emails, sharepoint objects etc.). We used Cassandra to store those billions of objects along with aggregated view of that metadata. Customers can analyse the corpus of data in real time by searching in completely flexible way i.e. be able to get summary aggregates for many billions of objects, and then be able to further drill down to items by filtering using various facets of the metadata. We achieve this using a combination of Cassandra and ElasticSearch. This presentation will talk about various data modelling techniques we use to aggregate and then further summarise all that metadata and be able to search the summary in real t
Add Historical Analysis of Operational Data with Easy Configurations in Fivet...Databricks
Fivetran makes it easy to automate data ingestion particularly for operational data sources such as Salesforce, Zendesk, and Oracle Eloqua, no matter how source schemas and APIs change. Achieving historical analysis is cumbersome, time-consuming, and costly to build and maintain manually. A common approach is to include snapshots, which only take into account changes at a given time. Plus, the additional storage requirements can become unwieldy to manage. Type 2 Slowly Changing Dimension (SCD) allows you to track any change at any point in time. This session shows how Fivetran History Mode, which uses Type 2 SCD, can be easily configured and then switched on with 1-click and synchronized for a desired time period. This accelerates time to insights, making it easy to both automate data ingestion and historical analysis.
The role of databases in modern application developmentMariaDB plc
The rise of serverless microservices, event-driven application architecture and full-stack development with JavaScript and the MEAN stack is changing what application developers need from databases – and how they interact with them. In this session, MariaDB's Thomas Boyd discusses recent advancements in application development and architecture and explains how MariaDB supports them.
Scylla Summit 2022: An Odyssey to ScyllaDB and Apache KafkaScyllaDB
Will LaForest is the Public Sector CTO for Confluent. In his current position, Will evangelizes the benefits of Apache Kafka, event-driven data in motion architecture, and open-source software is addressing mission challenges in the Government. He has spent 25 years wrangling data at massive scale. His technical career spans diverse areas from software engineering, NoSQL, data science, cloud computing, machine learning, and building statistical visualization software but began with code slinging at DARPA as a teenager. Will holds degrees in mathematics and physics from the University of Virginia.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Our journey with druid - from initial research to full production scaleItai Yaffe
Here at the Nielsen Marketing Cloud we use druid.io (http://druid.io/) as one of our main data stores, both for simple counts and for approximate count-distinct (DataSketches).
It’s been more than a year since we started using it, injecting billions of events each day to multiple druid clusters for different use-cases.
In this meet-up, we will share our journey, the challenges we had, the way we overcame them (at least most of them) and the steps we made to optimize the process around Druid to keep the solution cost effective.
Before diving into Druid, we will briefly present our data pipeline architecture, starting from the front-end serving system, deployed in number of geo-locations, to a centralized Kafka cluster in the cloud, and give some examples of the different processes that consume from Kafka and feed our different data sources.
NoSQL no more: SQL on Druid with Apache Calcitegianmerlino
Druid is an analytics-focused, distributed, scale-out data store. Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Up until version 0.10, Druid could only be queried in a JSON-based language that many users found unfamiliar.
Enter Apache Calcite. It includes an industry-standard SQL parser, validator, and JDBC driver, as well as a cost-based relational optimizer. Calcite bills itself as “the foundation for your next high-performance database” and is used by Hive, Drill, and a variety of other projects. Druid uses Calcite to power Druid SQL, a standards-based query API that vaults Druid out of the NoSQL world and into the SQL world.
Gian Merlino offers an overview of Druid SQL and explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
The recent boom in big data processing and democratization of the big data space has been enabled by the fact that most of the concepts originated in the research labs of companies such as Google, Amazon, Yahoo and Facebook are now available as open source. Technologies such as Hadoop, Cassandra let businesses around the world to become more data driven and tap into their massive data feeds to mine valuable insights.
At the same time, we are still at a certain stage of the maturity curve of these new big data technologies and of the entire big data technology stack. Many of the technologies originated from a particular use case and attempts to apply them in a more generic fashion are hitting the limits of their technological foundations. In some areas, there are several competing technologies for the same set of use cases, which increases risks and costs of big data implementations.
We will show how GoodData solves the entire big data pipeline today, starting from raw data feeds all the way up to actionable business insights. All this provided as a hosted multi-tenant environment letting its customers to solve their particular analytical use case or many analytical use cases for thousands of their customers all using the same platform and tools while abstracting them away from the technological details of the big data stack.
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
What we're about
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORCwhich technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL?how to handle streaming?how to manage costs?Performance tips?Security tip?Cloud best practices tips?
Some of our online materials:
Website:
https://big-data-demystified.ninja/
Youtube channels:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
Meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
https://www.meetup.com/Big-Data-Demystified
Facebook Group :
https://www.facebook.com/groups/amazon.aws.big.data.demystified/
Facebook page (https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/)
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Simply Business is a leading insurance provider for small business in the UK and we are now growing to the USA. In this presentation, I explain how our data platform is evolving to keep delivering value and adapting to a company that changes really fast.
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
A while ago I entered the challenging world of Big Data. As an engineer, at first, I was not so impressed with this field. As time went by, I realised more and more, The technological challenges in this area are too great to master by one person. Just look at the picture in this articles, it only covers a small fraction of the technologies in the Big Data industry…
Consequently, I created a meetup detailing all the challenges of Big Data, especially in the world of cloud. I am using AWS & GCP and Data Center infrastructure to answer the basic questions of anyone starting their way in the big data world.
how to transform data (TXT, CSV, TSV, JSON) into Parquet, ORC,AVRO which technology should we use to model the data ? EMR? Athena? Redshift? Spectrum? Glue? Spark? SparkSQL? GCS? Big Query? Data flow? Data Lab? tensor flow? how to handle streaming? how to manage costs? Performance tips? Security tip? Cloud best practices tips?
In this meetup we shall present lecturers working on several cloud vendors, various big data platforms such hadoop, Data warehourses , startups working on big data products. basically - if it is related to big data - this is THE meetup.
Some of our online materials (mixed content from several cloud vendor):
Website:
https://big-data-demystified.ninja (under construction)
Meetups:
https://www.meetup.com/Big-Data-Demystified
https://www.meetup.com/AWS-Big-Data-Demystified/
You tube channels:
https://www.youtube.com/channel/UCMSdNB0fGmX5dXI7S7Y_LFA?view_as=subscriber
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
Audience:
Data Engineers
Data Science
DevOps Engineers
Big Data Architects
Solution Architects
CTO
VP R&D
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
Learn more about the tools, techniques and technologies for working productively with data at any scale. This presentation introduces the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Jon Einkauf, Senior Product Manager, Elastic MapReduce, AWS
Alan Priestley, Marketing Manager, Intel and Bob Harris, CTO, Channel 4
Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.
Join us for this webinar where Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.
Dipti will cover:
-Open Data Lake analytics - what it is and what use cases it supports
-Why companies are moving to an open data lake analytics approach
-Why the open source data lake query engine Presto is critical to this approach
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...SnapLogic
In this webinar, learn how SnapLogic and Amazon Web Services helped Earth Networks create a responsive, self-service cloud for data integration, preparation and analytics.
We also discuss how Earth Networks gained faster data insights using SnapLogic’s Amazon Redshift data integration and other connectors to quickly integrate, transfer and analyze data from multiple applications.
To learn more, visit: www.snaplogic.com/redshift
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
3. Agenda
● The evolution of a data platform
● Data platform design principles
● Data platform technologies
● Data platform in the cloud
○ Data Lake - How to build
○ Data Lake - Technology selection
○ Data Propagation and Near real time processing - How to build
4. ● A unified platform for collecting, accessing and processing ALL of NI data
○ Collection - collect and persist
○ Standardized - consistent business data
○ Access - Standardized, Optimized, Ad-hoc, Applicative
● All in a stable, flexible, monitored, fast and cost effective data platform
● Making all of the company’s business related data available quickly for easy
consumption for creating insights and driving the business forward.
The Data Platform
5. “You have to be careful if you don’t know where you are going
because you might not get there!”
Yogi Berra
Data Platform Evolution
Technology always develops from the primitive, via the
complicated, to the simple.
Antoine de Saint-Exupéry
7. Data Platform Evolution - The monolith grows
● A Bigger monolith with a DB
Issues:
● Deployments start to slow down
8. Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Monolith
9. Data Platform Evolution
● A bigger monolith with a DB
● With some data tool
Issues:
● Dependency between
Monolith and data service
● Changes in monolith
breaks the data tools
● Data tools impact performance
10. Data Platform Evolution
● A distributed Monolith with a DB
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● Data tools impact performance
● The new services lock each other
● Monolith
Refactoring
11. Data Platform Evolution
● A distributed Monolith with a DB
● Data tools read from replica
Issues:
● Changes in DB schema break
The data tools
● Replica fails
● The new services lock each other
● Monolith
● Data freshness
Refactoring
12. Data Platform Evolution
● A distributed Monolith with a DB
● ETL
● With more data tools
Issues:
● Changes in DB schema break
The data tools
● The new services lock each other
● Monolith
● Data freshness
13. Data Platform Evolution
● Microservices
● Monolith DB + replica
● With more data tools
● Data warehouse
Issues:
● Changes in DB schema break
The ETL
● Getting data from Microservices
● Data warehouse flexibility +
performance
● Data freshness
Breaks all
the time
14. Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Warehouse
● More data tools
Issues:
● Data warehouse flexibility +
performance
● Events consistency
● Data freshness
15. Data Platform Evolution
● Applications events
● Event Bus
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
● Data freshness
16. Data Platform Evolution
● Applications events
● Event Bus
● Near Real Time
● ETL
● Data Lake
○ Metastore
○ Processing Engines
○ Data Stores
○ SQL access
● Any data application
Issues:
● Events consistency
Real Time
17. “Any problem in Computer Science can be solved with another
level of indirection”
– David Wheeler
“Except the problem of indirection complexity”
– Bob Morgan
Base principle used in the data platform evolution ...
18. Data Platform - Design Principles
● Event driven separation between producers and consumers of data
● Use the suitable technology for the problem
● Near real time access to all data
● Data Lake
○ All data goes to the data lake
○ Data Lake exposes data as Main flow of data
○ SQL/API/File access
○ Data is immutable
○ Data lake is the “source of truth” no other DB!
20. Data Platform Facets
● Data Propagation
○ Events Bus and Event Structuring
● Data Persistence
○ Durability, Partitioning and Formatting
● Data Access
○ Allow users/applications access to
data in any SLA needed
● Data Standardization
○ Unified business data
● Data Processing
○ ETLs, Algorithms and apps processing
infra
Real Time
22. Data Lake - Core Parts
● Scalable object store
● Data digest ETLs
● Data
○ format and partition
● A metastore/Dictionary
● Processing Engines
● Data Lake APIs
○ SQL accessible
23. Data Lake - Technologies - DIY
● HDFS
● Hive MetaStore
● Processing
○ Spark
○ Tez
○ M/R
● Data Access
○ Spark SQL
○ Impala
○ Presto
● Parquet formating
Cloudera, HortonWorks, MapR
25. Data Lake - Technologies - AWS - DIY Hybrid
● S3
● Spark on EMR -
○ ETL and Processing
● Athena
● RedShift & Spectrum
● AWS Glue Metastore
● Parquet
EMR
Glue
Metastore
26. Cloud Data Lake - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 7 10 10 8 8
Scalability 9 10 10 9 9
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 7 10 10 7 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 8 8 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
Acronym: DVOF-FACTS :)
28. Data Propagation
● Event Structure and Format
○ Json, Avro, Protobuf...
● Event bus
○ Event based flow of information between
the systems
○ Integration with external system using
the events
○ Decouple data construction from data
consumption
○ Kinesis/firehose
○ Kafka/confluent
29. Event structure
Event Header
Platform Header
"platform_header": {
"platform": "{system}",
"service": "{service name}"
},
A single
Event
{
"event_header": {
"id": "{guid}",
"event type": "{map the schema} ",
"action": "publish",
"schema_version": "{schema evolution}",
“event_time” : "2017-09-07T07:17:31.503Z"
},
Specific Event Data
“data”: {
// all other specific fields of the event
…
}
}
Other Optional
Headers
"some_header": {
"from": "2017-04-01",
"to": "2017-04-01",
"someType": "bla",
},
30. Near Real Time - Core Parts
● Event Bus
● Streaming processing engines
● NoSQL DBs
Real Time
31. Near Real Time - DIY
● Amazon
○ Kinesis firehose - write to s3/RedShift
warehouse
○ Kinesis Analytics
○ DynamoDB
● Streaming processing engines
○ Spark Streaming
○ Flink
○ Confluent Kafka
○ Kinesis Streams
○ ...
● Proprietary NoSQL DBs
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
○ Elastic
Real Time
32. Near Real Time - AWS
● Data propagation
○ Kinesis firehose - write to s3/RedShift
warehouse
○ DynamoDB
○ RedShift
● Streaming processing engines
○ Kinesis Analytics
○ ...
● NoSQL DBs
○ Managed Elastic DynamoDB
firehose
Real Time
33. Near Real Time - AWS - DIY Hybrid
● Data propagation
○ Confluent Kafka
● Streaming processing engines
○ EMR + Spark Streaming
○ EMR + Flink
● NoSQL DBs
○ Managed Elastic
○ DynamoDB
○ MemSQL
○ Snowflake
○ Couchbase
○ Arrowspike
○ Cassandra
DynamoDB
EMR
Real Time
34. Near Real Time - DIY vs. AWS vs. ...
AWS Open - DIY Hybrid DIY Hybrid Fully Managed Proprietary
Features 5 10 10 9 ?
Scalability 8 10 10 9 8
Operation Easy Hard Medium Easy Easy
Availability 10 9-10 9-10 9-10 6
Flexibility 6 10 10 9 6
Dev effort Medium Hard Medium Medium Easy
Testability 7 10 10 9 4
Cost Start - Low
Run - High
Start - High
Run - Medium
Start - Medium
Run - Medium
Start - Low
Run - High
Start - Low
Run - High
Vendor Lock High None Low Low Damn
35. ● A data platform in the cloud is the same as a private data platform but with the
option of using managed solutions!
● Structure your data from your producers - remember: garbage in, garbage out!
● Pick the right technology for your problem!
● Choose your solution using these aspects:
○ Dev effort
○ Vendor Locking
○ Operation effort
○ Flexibility
○ Features
○ Availability
○ Cost
○ Testability
○ Scalability
Bottom Line
Acronym: DVOF-FACTS :)