The document provides an agenda and slides for a presentation on architectural considerations for data warehousing with Hadoop. The presentation discusses typical data warehouse architectures and challenges, how Hadoop can complement existing architectures, and provides an example use case of implementing a data warehouse with Hadoop using the Movielens dataset. Key aspects covered include ingestion of data from various sources using tools like Flume and Sqoop, data modeling and storage formats in Hadoop, processing the data using tools like Hive and Spark, and exporting results to a data warehouse.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog.
In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
This presentation introduces the concept of monitoring - focusing on why and how and finally on the tools to use. It introduces Prometheus (metrics gathering, processing, alerting), application instrumentation and Prometheus exporters and finally it introduces Grafana as a common companion for dashboarding, alerting and notifications. This presentations also introduces the handson workshop - for which materials are available from https://github.com/lucasjellema/monitoring-workshop-prometheus-grafana
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
As organizations pursue Big Data initiatives to capture new opportunities for data-driven insights, data governance has become table stakes both from the perspective of external regulatory compliance as well as business value extraction internally within an enterprise. This session will introduce Apache Atlas, a project that was incubated by Hortonworks along with a group of industry leaders across several verticals including financial services, healthcare, pharma, oil and gas, retail and insurance to help address data governance and metadata needs with an open extensible platform governed under the aegis of Apache Software Foundation. Apache Atlas empowers organizations to harvest metadata across the data ecosystem, govern and curate data lakes by applying consistent data classification with a centralized metadata catalog.
In this talk, we will present the underpinnings of the architecture of Apache Atlas and conclude with a tour of governance capabilities within Apache Atlas as we showcase various features for open metadata modeling, data classification, visualizing cross-component lineage and impact. We will also demo how Apache Atlas delivers a complete view of data movement across several analytic engines such as Apache Hive, Apache Storm, Apache Kafka and capabilities to effectively classify, discover datasets.
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
This presentation introduces the concept of monitoring - focusing on why and how and finally on the tools to use. It introduces Prometheus (metrics gathering, processing, alerting), application instrumentation and Prometheus exporters and finally it introduces Grafana as a common companion for dashboarding, alerting and notifications. This presentations also introduces the handson workshop - for which materials are available from https://github.com/lucasjellema/monitoring-workshop-prometheus-grafana
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
In this session, we will start with the importance of monitoring of services and infrastructure. We will discuss about Prometheus an opensource monitoring tool. We will discuss the architecture of Prometheus. We will also discuss some visualization tools which can be used over Prometheus. Then we will have a quick demo for Prometheus and Grafana.
August Pittsburgh Hadoop User group meetup ( http://www.meetup.com/HUG-Pittsburgh/events/195143712/ ) where we discuss Apache Drill and how it can provide agility, flexibility, and speed to both structured and unstructured data analytics, on hadoop and otherwise.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
Learn how Apache Atlas is being enhanced to provide a universal open metadata and governance platform for all data processing across the enterprise. With open metadata, multiple metadata repositories, potentially from different vendors, can operate collaboratively to create an enterprise catalog of data that can be located, understood, used and governed. In this talk we will provide a detailed description of the extensions to the type system, new APIs, the connector framework, metadata discovery framework, governance action framework and the inter-operability that we are adding to Apache Atlas. We will show examples of these features in operation. For example, (1) how metadata is discovered and gathered into Apache Atlas, (2) how applications and tools access metadata, (3) how enforcement engines such as Apache Ranger keep synchronized with the latest governance requirements and (4) how to build an adapter to allow other vendor's metadata repositories can exchange metadata with Apache Atlas repositories. We will also explain how these features can be deployed together to support the Hadoop platform, and the enterprise beyond. This session will be presented by Nigel Jones - IBM & Ferd Schapers - ING Chief Information Architect
Speaker:
Nigel Jones, Software Architect, IBM Analytics Group, IBM
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
"Strategies for supporting near real time analytics, OLAP, and interactive data exploration" - Dr. Jeremy Engle, Engineering Manager Data Team at Jellyvision
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.
In this session, we will start with the importance of monitoring of services and infrastructure. We will discuss about Prometheus an opensource monitoring tool. We will discuss the architecture of Prometheus. We will also discuss some visualization tools which can be used over Prometheus. Then we will have a quick demo for Prometheus and Grafana.
August Pittsburgh Hadoop User group meetup ( http://www.meetup.com/HUG-Pittsburgh/events/195143712/ ) where we discuss Apache Drill and how it can provide agility, flexibility, and speed to both structured and unstructured data analytics, on hadoop and otherwise.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Data ingest is a deceptively hard problem. In the world of big data processing, it becomes exponentially more difficult. It's not sufficient to simply land data on a system, that data must be ready for processing and analysis. The Kite SDK is a data API designed for solving the issues related to data infest and preparation. In this talk you'll see how Kite can be used for everything from simple tasks to production ready data pipelines in minutes.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
In the crowded SQL-on-Hadoop market, choosing the right solution for your business can be difficult. In this webinar, learn firsthand from Rick van der Lans, independent analyst and managing director of R20/Consultancy, how to sort through this market complexity and what tough questions to ask when evaluating perspective SQL-on-Hadoop solutions.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Microsoft Data Platform - What's includedJames Serra
The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source.
Azure Cafe Marketplace with Hortonworks March 31 2016Joan Novino
Azure Big Data: “Got Data? Go Modern and Monetize”.
In this session you will learn how to architected, developed, and build completely in the open, Hortonworks Data Platform (HDP) that provides an enterprise ready data platform to adopt a Modern Data Architecture.
Are you confused by Big Data? Get in touch with this new "black gold" and familiarize yourself with undiscovered insights through our complimentary introductory lesson on Big Data and Hadoop!
Architecting a next-generation data platformhadooparchbook
Slides for Architecting a next-generation data platform at Strata + Hadoop World, London 2017.
https://conferences.oreilly.com/strata/strata-eu/public/schedule/detail/57652
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Data warehousing with Hadoop
1. REMINDER
Check in on the
COLLABORATE mobile app
Architectural Considerations for
Data Warehousing with Hadoop
Prepared by:
Mark Grover, Software Engineer
Jonathan Seidman, Solutions Architect
Cloudera, Inc.
github.com/hadooparchitecturebook/h
adoop-arch-book/tree/master/ch11-
data-warehousing
Session ID#: 10251
@mark_grover
@jseidman
2. About Us
■ Mark
▪ Software Engineer at
Cloudera
▪ Committer on Apache
Bigtop, PMC member on
Apache Sentry
(incubating)
▪ Contributor to Apache
Hadoop, Spark, Hive,
Sqoop, Pig and Flume
■ Jonathan
▪ Senior Solutions
Architect/Partner
Engineering at Cloudera
▪ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
▪ Co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
3. About the Book
■ @hadooparchbook
■ hadooparchitecturebook.com
■ github.com/hadooparchitectur
ebook
■ slideshare.com/hadooparchbo
ok
4. Agenda
■ Typical data warehouse architecture.
■ Challenges with the existing data warehouse architecture.
■ How Hadoop complements an existing data warehouse
architecture.
■ (Very) Brief intro to Hadoop.
■ Example use case.
■ Walkthrough of example use case implementation.
6. Example High Level Data Warehouse
Architecture
Extract
Data
Staging
Area
Operational
Source
Systems
Load
Data
Warehouse
Data
Analysis/Visu
alization Tools
Transformations
8. Challenge – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
9. Challenges – ETL/ELT Processing
OLTP
Enterprise
Applications
ODS
Data Warehouse
QueryExtract
Transform
Load
Business
Intelligence
Transform
1 Slow Data Transformations = Missed ETL SLAs.
2 Slow Queries = Frustrated Business Users.
1
2
1
10. Challenges – Data Archiving
Data
Warehouse
Tape
Archive
■ Full-fidelity data only kept for a short duration
■ Expensive or sometimes impossible to look at historical raw
data
11. Challenge – Disparate Data Sources
Data
Warehouse
■ How do you join data from disparate sources with EDW?
Business
Intelligence
???
12. Challenge – Lack of Agility
■ Responding to changing requirements, mistakes, etc.
requires lengthy processes.
13. Challenge – Exploratory Analysis in the
EDW
■ Difficult for users to do exploratory analysis of data in the data
warehouse.
Business
Users
Developers Analysts
Data
Warehouse
19. Agile Data Access with Hadoop
Schema-on-Write (RDBMS):
• Prescriptive Data Modeling:
• Create static DB schema
• Transform data into RDBMS
• Query data in RDBMS format
• New columns must be added
explicitly before new data can
propagate into the system.
• Good for Known Unknowns
(Repetition)
Schema-on-Read (Hadoop):
• Descriptive Data Modeling:
• Copy data in its native format
• Create schema + parser
• Query Data in its native format
• New data can start flowing any time
and will appear retroactively once the
schema/parser properly describes it.
• Good for Unknown Unknowns
(Exploration)
22. What is Apache Hadoop?
Has the Flexibility to Store
and Mine Any Type of Data
Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
Not bound by a single schema
Excels at
Processing Complex Data
Scale-out architecture divides
workloads across multiple nodes
Flexible file system eliminates ETL
bottlenecks
Scales
Economically
Can be deployed on commodity
hardware
Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
Parallel
Processing
(MapReduce,
Spark, Impala,
etc.)
Distributed Computing
Frameworks
Apache Hadoop is an open source
platform for data storage and
processing that is…
Scalable
Fault tolerant
Distributed
CORE HADOOP SYSTEM COMPONENTS
23. Oracle Big Data Appliance
■ All of the capabilities we’re talking about here are available as
part of the Oracle BDA.
26. Other Challenges – Architectural
Considerations
Data
Sources
Ingestion
Raw Data
Storage
(Formats,
Schema)
Processed
Data
Storage
(Formats,
Schema)
Processing
Data
Consumption
Orchestration
(Scheduling,
Managing,
Monitoring)
Metadata
Management
27. Hadoop Third Party Ecosystem
Data
Systems
Applications
Infrastructure
Operational
Tools
29. Use-case
■ Movielens dataset
■ Users register by entering some demographic information
▪ Users can update demographic information later on
■ Rate movies
▪ Ratings can be updated later on
■ Auxillary information about movies available
▪ e.g. release date, IMDB URL, etc.
30. Movielens data set
u.user
user id | age | gender | occupation | zip code
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
31. Movielens data set
u.item
movie id | movie title | release date | video release date
| IMDb URL | unknown | Action | Adventure | Animation
|Children's | Comedy | Crime | Documentary | Drama |
Fantasy | Film-Noir | Horror | Musical | Mystery | Romance
| Sci-Fi | Thriller | War | Western |
1|Toy Story (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0
|0|0|0
2|GoldenEye (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1
|0|0
3|Four Rooms (1995)|01-Jan-
1995||http://us.imdb.com/M/title-
exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
0|1|0|0
32. Movielens data set
u.data
user id | item id | rating | timestamp
196|242|3|881250949
186|302|3|891717742
22|377|1|878887116
244 51|2|880606923
166|346|1|886397596
36. Data Modeling Considerations
■ We need to consider the following in our architecture:
▪ Storage layer – HDFS? HBase? Etc.
▪ File system schemas – how will we lay out the data?
▪ File formats – what storage formats to use for our data, both raw
and processed data?
▪ Data compression formats?
■ Hadoop is not a database, so these considerations will be
different from an RDBMS.
38. Why Denormalize?
■ Regular joins are expensive in Hadoop
■ When you have 2 data sets, no guarantees that
corresponding records will be present on the same
■ Such a guarantee exists when storing such data in a single
data set
44. Hadoop File Types
■ Formats designed specifically to store and process data on
Hadoop:
▪ File based – SequenceFile
▪ Serialization formats – Thrift, Protocol Buffers, Avro
▪ Columnar formats – RCFile, ORC, Parquet
46. Our Storage Format Recommendation
■ Columnar format (Parquet) for merged/compacted data sets
▪ user, user_rating, movie
■ Row format (Avro) for history/append-only data sets
▪ user_history, user_rating_fact
49. Ingestion – Apache Kafka
Source System Source System Source System Source System
Hadoop
Security
Systems
Real-time
monitoring
Data
Warehouse
Kafka
50. Ingestion – Apache Sqoop
■ Apache project designed to ease import and export of data
between Hadoop and external data stores such as an
RDBMS.
■ Provides functionality to do bulk imports and exports of data.
■ Leverages MapReduce to transfer data in parallel.
Client Sqoop
MapReduce Map Map Map
Hadoop
Run import Collect metadata
Generate code,
Execute MR job
Pull data
Write to Hadoop
51. Sqoop Import Example – Movie
sqoop import --connect
jdbc:mysql://mysql_server:3306/movielens
--username myuser --password mypass --query
'SELECT movie.*, group_concat(genre.name)
FROM movie
JOIN movie_genre ON (movie.id =
movie_genre.movie_id)
JOIN genre ON (movie_genre.genre_id = genre.id)
WHERE ${CONDITIONS}
GROUP BY movie.id'
--split-by movie.id --as-avrodatafile
--target-dir /data/movielens/movie
55. Merge Updates
hive>INSERT OVERWRITE TABLE user_tmp
SELECT user.*
FROM user
LEFT OUTER JOIN user_upserts
ON (user.id = user_upserts.id)
WHERE
user_upserts.id IS NULL
UNION ALL
SELECT
id, age, occupation, zipcode,
TIMESTAMP(last_modified)
FROM user_upserts;
63. Please complete the session
evaluation
Thank you!
@hadooparchbook
You may complete the session evaluation either
on paper or online via the mobile app
64. This is a slide title that can be up to two
lines of text without losing readability
■ This is the first bullet of text
■ This is the second bullet of text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet of text (and should be as far sub-
bullets indent)
– This tertiary sub-bullet will be seldom used, but available
▪ This is another sub-bullet of text
■ And this is the third bullet of text
65. This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-
bullet of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet
of text
■ Senior Solutions
Architect/Partner
Enablement at Cloudera
■ Previously, Technical Lead
on the big data team at
Orbitz Worldwide
66. This is a slide title (one or two lines)
■ This is the first bullet of text
▪ This is a sub-bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
— This is a secondary sub-bullet
of text
– This tertiary sub-bullet
that can be use
▪ This is another sub-bullet of
text
■ This is the first bullet of text
■ This is the second bullet of
text at the same level
▪ This is a sub-bullet of text
▪ This is a sub-bullet of text
— This secondary sub-bullet
▪ This is another sub-bullet of
text
■ And this is another bullet of
text
Subject number one Subject number two
67. This is a slide title for a slide with just the
title line (e.g., images/diagrams below)
68.
69. What is Hadoop?
Hadoop is an open-source system designed
To store and process petabyte scale data.
That’s pretty much what you need to know.
Well almost…
71. Our Compression Codec Recommendation
■ Snappy for all data sets (columnar as well as row based)
72. File Format Choices
Data set Storage format Compression Codec
movie Parquet Snappy
user_history Avro Snappy
user Parquet Snappy
user_rating_fact Avro Snappy
user_rating Parquet Snappy