Siligong.Data - May 2021 - Transforming your analytics workflow with dbt

•Download as PPTX, PDF•

1 like•396 views

Learn about dbt and its core concepts and where it can sit in your analytics engineering workflow - Jon Su @ Internetrix

Technology

Transforming your analytics
workflow with dbt
#Siligong.Data May 2021 Meetup
Jon Su
Jon Su @ Internetrix

Who knows what SQL is?
(pronounced sequel or ss-cue-l depending on who you ask)
https://www.commitstrip.com/en/?
Jon Su @ Internetrix

Tonight’s Talk: dbt (data build tool)
● A History Lesson
● What is dbt?
● Demo: dbt in action
Jon Su @ Internetrix

ETL was all the craze!
https://blog.bismart.com/en/what-do-we-do-etl
Jon Su @ Internetrix

● How to build a data
infrastructure that can
scale
● Controlling storage
costs
● Performance Tuning
The focus for data teams was on extraction
Jon Su @ Internetrix

But then came the “Cloud”...
This is how the Cloud
actually looks like in real-
life ;)
Jon Su @ Internetrix

Industry Trend #1: Move away from “do-it-all-in-1-tool”
https://lakefs.io/the-state-of-data-engineering-in-2021/
Jon Su @ Internetrix

Industry Trend #2: Shift from ETL to ELT
https://www.striim.com/etl-vs-elt/
Jon Su @ Internetrix

But there are still problems...
https://www.striim.com/etl-vs-elt/
Data Science
BI Tools
Jon Su @ Internetrix

Analytics workflow problems...
● Data consumers don’t have the data when they need it
○ Silos between different members of a traditional data team
Data Analyst
Data Engineer Business
Triangle of
Madness
Jon Su @ Internetrix

Analytics workflow problems...
● Beautiful dashboards that suddenly break when something upstream goes wrong /
source schema changes
● Having to rewrite and rewrite the same piece of SQL again & again….
○ Not sharing analytics code in a team
○ Analysts work in isolation, knowledge isn’t shared
○ Different definitions of a shared metric
● Hard for a business to adopt using BI easily
○ time + $
○ Low BI adoption
Jon Su @ Internetrix

Does this have to be the way?
Jon Su @ Internetrix

dbt (data build tool) lets anyone who knows SQL author their
analytics workflow and make their own data pipelines.
If you know SQL, you can use it - essentially no barrier to entry
Supports a large number of warehouse through adapters
● Can build your own adapter
“
Jon Su @ Internetrix

Introduces basic software engineering principles to
solve the workflow problems we mentioned!
Jon Su @ Internetrix

Version
Control
Quality
Assurance
Modularity
Multiple
Environments
Documentation
Automated
Tools
Code
Maintainability
https://docs.getdbt.com/docs/about/viewpoint
Jon Su @ Internetrix

How does dbt work?
https://www.getdbt.com/product/
Jon Su @ Internetrix

How does dbt work?
A dbt project consists of .sql and .yml files:
1. Write dbt code (SQL + Jinja templating)
2. Run dbt command from the CLI or dbt Cloud
3. dbt compiles your dbt code into raw SQL and executes that code against your warehouse.
4. Data is transformed and then created as tables/views back in the data warehouse
Jon Su @ Internetrix

Demo
Scenario:
● Google Merchandise Store
Dataset:
● Google Analytics sample dataset for
BigQuery
Goal:
● Find all Purchases made in Feb 2017
by Users who previously visited the site
using a Chrome browser in Jan 2017
Jon Su @ Internetrix

● dbt website: https://getdbt.com
● Demo Source Code:
https://github.com/jkersu/dbt-basic-demo
● Get in touch by email at:
jon@irx.io
The End.
Jon Su @ Internetrix

*Event* DBT (Data Build Tool) an ELT approach for Advanced Analytics (wearecommunity.io) https://wearecommunity.io/events/dbt-data-build-tool-an-elt-approach-for-advanced-analytics *Demo* Goal: calculate monthly sales values by category Tech stacks: DBT, Databricks, Azure Blob Data: Brazilian E-Commerce Public Dataset by Olist (Kaggle) Github: https://github.com/ongxuanhong/de05-dbt-databricks YouTube: https://youtu.be/l4Mug-Qp3ag

The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...

Databricks

A traditional data team has roles including data engineer, data scientist, and data analyst. However, many organizations are finding success by integrating a new role – the analytics engineer. The analytics engineer develops a code-based data infrastructure that can serve both analytics and data science teams. He or she develops re-usable data models using the software engineering practices of version control and unit testing, and provides the critical domain expertise that ensures that data products are relevant and insightful. In this talk we’ll talk about the role and skill set of the analytics engineer, and discuss how dbt, an open source programming environment, empowers anyone with a SQL skillset to fulfill this new role on the data team. We’ll demonstrate how to use dbt to build version-controlled data models on top of Delta Lake, test both the code and our assumptions about the underlying data, and orchestrate complete data pipelines on Apache Spark™.

3D: DBT using Databricks and Delta

Databricks

Data Build Tool (DBT) is an open source technology to set up your data lake using best practices from software engineering. This SQL first technology is a great marriage between Databricks and Delta. This allows you to maintain high quality data and documentation during the entire datalake life-cycle. In this talk I’ll do an introduction into DBT, and show how we can leverage Databricks to do the actual heavy lifting. Next, I’ll present how DBT supports Delta to enable upserting using SQL. Finally, we show how we integrate DBT+Databricks into the Azure cloud. Finally we show how we emit the pipeline metrics to Azure monitor to make sure that you have observability over your pipeline.

dbt Python models - GoDataFest by Guillermo Sanchez

GoDataDriven

Guillermo Sanchez presented on the pros and cons of using Python models in dbt. While Python models allow for more advanced analytics and leveraging the Python ecosystem, they also introduce more complexity in setup and divergent APIs across platforms. Additionally, dbt may not be well-suited for certain use cases like ingesting external data or building full MLOps pipelines. In general, Python models are best for the right analytical use cases, but caution is needed, especially for production environments.

Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg

Anant Corporation

Data Engineer's Lunch #54: dbt and Spark

Anant Corporation

In Data Engineer's Lunch #54, we will discuss the data build tool, a tool for managing data transformations with config files rather than code. We will be connecting it to Apache Spark and using it to perform transformations. Accompanying YouTube: https://youtu.be/dwZlYG6RCSY Sign Up For Our Newsletter: http://eepurl.com/grdMkn Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday: https://www.meetup.com/Data-Wranglers-DC/events/ Cassandra.Link: https://cassandra.link/ Follow Us and Reach Us At: Anant: https://www.anant.us/ Awesome Cassandra: https://github.com/Anant/awesome-cassandra Email: solutions@anant.us LinkedIn: https://www.linkedin.com/company/anant/ Twitter: https://twitter.com/anantcorp Eventbrite: https://www.eventbrite.com/o/anant-1072927283 Facebook: https://www.facebook.com/AnantCorp/ Join The Anant Team: https://www.careers.anant.us

Airbyte @ Airflow Summit - The new modern data stack

Michel Tricot

The document introduces the modern data stack of Airbyte, Airflow, and dbt. It discusses how ELT addresses issues with traditional ETL processes by separating extraction, loading, and transformation. Extraction and loading involve general-purpose routines to pull and push raw data, while transformation uses business logic specific to the organization. The stack is presented as an open solution that allows composing with best of breed tools for each part of the data pipeline. Airbyte provides data integration, dbt enables data transformation with SQL, and Airflow handles scheduling. The demo shows how these tools can be combined to build a flexible, autonomous, and future proof modern data stack.

Delta lake and the delta architecture

Adam Doyle

- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS. - It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics. - Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.

Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh? In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry. The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems. This session is targeted for architects, decision-makers, data-engineers, and system designers.

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Adam Doyle

Data Pipline Observability meetup

Omid Vahdaty

This document discusses the need for observability in data pipelines. It notes that real data pipelines often fail or take a long time to rerun without providing any insight into what went wrong. This is because of frequent code, data, dependency, and infrastructure changes. The document recommends taking a production engineering approach to observability using metrics, logging, and alerting tools. It also suggests experiment management and encapsulating reporting in notebooks. Most importantly, it stresses measuring everything through metrics at all stages of data ingestion and processing to better understand where issues occur.

Speeding Time to Insight with a Modern ELT Approach

Databricks

The availability of new tools in the modern data stack is changing the way data teams operate. Specifically, the modern data stack supports an “ELT” approach for managing data, rather than the traditional “ETL” approach. In an ELT approach, data sources are automatically loaded in a normalized state into Delta Lake and opinionated transformations happen in the data destination using dbt. This workflow allows data analysts to move more quickly from raw data to insight, while creating repeatable data pipelines robust to changes in the source datasets. In this presentation, we’ll illustrate how easy it is for even a data analytics team of one to to develop an end-to-end data pipeline. We’ll load data from GitHub into Delta Lake, then use pre-built dbt models to feed a daily Redash dashboard on sales performance by manager, and use the same transformed models to power the data science team’s predictions of future sales by segment.

Intro to Delta Lake

Databricks

Delta Lake brings reliability, performance, and security to data lakes. It provides ACID transactions, schema enforcement, and unified handling of batch and streaming data to make data lakes more reliable. Delta Lake also features lightning fast query performance through its optimized Delta Engine. It enables security and compliance at scale through access controls and versioning of data. Delta Lake further offers an open approach and avoids vendor lock-in by using open formats like Parquet that can integrate with various ecosystems.

Introduction SQL Analytics on Lakehouse Architecture

Databricks

This document provides an introduction and overview of SQL Analytics on Lakehouse Architecture. It discusses the instructor Doug Bateman's background and experience. The course goals are outlined as describing key features of a data Lakehouse, explaining how Delta Lake enables a Lakehouse architecture, and defining features of the Databricks SQL Analytics user interface. The course agenda is then presented, covering topics on Lakehouse Architecture, Delta Lake, and a Databricks SQL Analytics demo. Background is also provided on Lakehouse architecture, how it combines the benefits of data warehouses and data lakes, and its key features.

Databricks Platform.pptx

Alex Ivy

The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.

Snowflake Overview

Snowflake Computing

Intro to databricks delta lake

Mykola Zerniuk

Building an open data platform with apache iceberg

Alluxio, Inc.

Building Lakehouses on Delta Lake with SQL Analytics Primer

Databricks

Databricks Fundamentals

Dalibor Wijas

This document is a training presentation on Databricks fundamentals and the data lakehouse concept by Dalibor Wijas from November 2022. It introduces Wijas and his experience. It then discusses what Databricks is, why it is needed, what a data lakehouse is, how Databricks enables the data lakehouse concept using Apache Spark and Delta Lake. It also covers how Databricks supports data engineering, data warehousing, and offers tools for data ingestion, transformation, pipelines and more.

From Data Warehouse to Lakehouse

Modern Data Stack France

Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology. Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus. Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

Introduction to Dremio

Dremio Corporation

An introduction to self-service data with Dremio. Dremio reimagines analytics for modern data. Created by veterans of open source and big data technologies, Dremio is a fundamentally new approach that dramatically simplifies and accelerates time to insight. Dremio empowers business users to curate precisely the data they need, from any data source, then accelerate analytical processing for BI tools, machine learning, data science, and SQL clients. Dremio starts to deliver value in minutes, and learns from your data and queries, making your data engineers, analysts, and data scientists more productive.

Modernizing to a Cloud Data Architecture

Databricks

Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.

Data Lakehouse, Data Mesh, and Data Fabric (r1)

James Serra

Incremental View Maintenance with Coral, DBT, and Iceberg

Walaa Eldin Moustafa

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Databricks

Many had dubbed 2020 as the decade of data. This is indeed an era of data zeitgeist. From code-centric software development 1.0, we are entering software development 2.0, a data-centric and data-driven approach, where data plays a central theme in our everyday lives. As the volume and variety of data garnered from myriad data sources continue to grow at an astronomical scale and as cloud computing offers cheap computing and data storage resources at scale, the data platforms have to match in their abilities to process, analyze, and visualize at scale and speed and with ease — this involves data paradigm shifts in processing and storing and in providing programming frameworks to developers to access and work with these data platforms. In this talk, we will survey some emerging technologies that address the challenges of data at scale, how these tools help data scientists and machine learning developers with their data tasks, why they scale, and how they facilitate the future data scientists to start quickly. In particular, we will examine in detail two open-source tools MLflow (for machine learning life cycle development) and Delta Lake (for reliable storage for structured and unstructured data). Other emerging tools such as Koalas help data scientists to do exploratory data analysis at scale in a language and framework they are familiar with as well as emerging data + AI trends in 2021. You will understand the challenges of machine learning model development at scale, why you need reliable and scalable storage, and what other open source tools are at your disposal to do data science and machine learning at scale.

Moving to Databricks & Delta

Databricks

At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.

Workshop on Google Cloud Data Platform

GoDataDriven

The document provides an agenda and information about a GoDataFest workshop on Google Cloud Platform for data. The agenda includes an introduction to GCP for data, a session on roles and tools on GCP for different data roles, and a session where participants will build projects on GCP in mixed workgroups. It outlines the goals and tools used by different roles like data engineer, analytics engineer, and Looker user. It also provides information on Google Cloud technologies like BigQuery, Dataform, Looker, and how they fit into the modern data lifecycle and platform. Participants are then divided into mixed workgroups based on their preferred role and given insights to explore in their projects.

SoftElegance Services: Data Science, Data Engineering, Big Data Architecture

Daryna Dubitska

What's hot

Architect’s Open-Source Guide for a Data Mesh Architecture

Databricks

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Adam Doyle

Data Pipline Observability meetup

Omid Vahdaty

Speeding Time to Insight with a Modern ELT Approach

Databricks

Intro to Delta Lake

Databricks

Introduction SQL Analytics on Lakehouse Architecture

Databricks

Databricks Platform.pptx

Alex Ivy

Snowflake Overview

Snowflake Computing

Intro to databricks delta lake

Mykola Zerniuk

Building an open data platform with apache iceberg

Alluxio, Inc.

Building Lakehouses on Delta Lake with SQL Analytics Primer

Databricks

Databricks Fundamentals

Dalibor Wijas

From Data Warehouse to Lakehouse

Modern Data Stack France

Apache Iceberg - A Table Format for Hige Analytic Datasets

Alluxio, Inc.

Introduction to Dremio

Dremio Corporation

Modernizing to a Cloud Data Architecture

Databricks

Data Lakehouse, Data Mesh, and Data Fabric (r1)

James Serra

Incremental View Maintenance with Coral, DBT, and Iceberg

Walaa Eldin Moustafa

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Databricks

Moving to Databricks & Delta

Databricks

What's hot (20)

Architect’s Open-Source Guide for a Data Mesh Architecture

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Data Pipline Observability meetup

Speeding Time to Insight with a Modern ELT Approach

Intro to Delta Lake

Introduction SQL Analytics on Lakehouse Architecture

Databricks Platform.pptx

Snowflake Overview

Intro to databricks delta lake

Building an open data platform with apache iceberg

Building Lakehouses on Delta Lake with SQL Analytics Primer

Databricks Fundamentals

From Data Warehouse to Lakehouse

Apache Iceberg - A Table Format for Hige Analytic Datasets

Introduction to Dremio

Modernizing to a Cloud Data Architecture

Data Lakehouse, Data Mesh, and Data Fabric (r1)

Incremental View Maintenance with Coral, DBT, and Iceberg

The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...

Moving to Databricks & Delta

Similar to Siligong.Data - May 2021 - Transforming your analytics workflow with dbt

Workshop on Google Cloud Data Platform

GoDataDriven

SoftElegance Services: Data Science, Data Engineering, Big Data Architecture

Daryna Dubitska

Continuum Analytics and Python

Travis Oliphant

Reinventing DDC in the Age of Data Analytics

Memoori

Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...

Daniel Zivkovic

Two #ModernDataStack talks and one DevOps talk: https://youtu.be/4R--iLnjCmU 1. "From Data-driven Business to Business-driven Data: Hands-on #DataModelling exercise" by Jacob Frackson of Montreal Analytics 2. "Trends in the #DataEngineering Consulting Landscape" by Nadji Bessa of Infostrux Solutions 3. "Building Secure #Serverless Delivery Pipelines on #GCP" by Ugo Udokporo of Google Cloud Canada We ran out of time for the 4th presenter, so the event will CONTINUE in March... stay tuned! Compliments of #ServerlessTO.

Demystifying IoT skills : What does it take to become a FullStack IoT engineer?

Emertxe Information Technologies Pvt Ltd

India Electronics Week (IEW) is the flagship annual IoT conference held in Bengaluru which unities the entire electronics industry. This is an annual event, where exhibitors, innovators, designers, manufacturers or sellers presenting electronic products meet and explore potential possibilities. This event includes exposition, conference, workshops and seminars in IoT and embedded product development. In IEW conference exposes students, industry experts, researchers and business owners to the smarter world of electronics. Statistically it attracts 100+ plus speakers, 10000+ plus visitors and 150+ exhibitors.This year India Electronics Week 2019 conference will be held in Bengaluru from 26th to 28th February 2019. This conference is organized by country’s leading electronics media – Electronics For You. From Emertxe we delivered a talk titled "Demystifying IoT skills : What does it take to become a FullStack IoT engineer?".IoT has somehow become the "six blind men" story where everyone is intrepreting in their own perspective. Some may say IoT as "old wine in a new bottle" others may say "it is the next big thing". Given this scenario, as an electronics / embedded / core engineer, the challenge is always been: How to intrepret IoT from skills perspective? What additional skills an electronics engineer should have in order to have a career in this domain? How different Embedded and IoT job profiles are? What role protocols play in IoT? The proposed talk is based on our years of market research and delivering IoT based education programs for various customer segments - Fresh Engineers, Corporate Organizations and Working professionals.

Data Con LA 2022 - Open Source Large Knowledge Graph Factory

Data Con LA

Russell Jurney, Founder, Graphlet AI The knowledge graph and graph database markets have long asked themselves: why aren't we larger? The vision of the semantic web was that many datasets could be cross-referenced between independent graph databases to map all knowledge on the web from myriad disparate datasets into one or more authoritative ontologies which could be accessed by writing SPARQL queries to work across knowledge graphs. The reality of dirty data made this vision impossible. Most time is spent cleaning data which isn't in the format you need to solve your business problems. Multiple datasets in different formats each have quirks. Deduplicate data using entity resolution is an unsolved problem for large graphs. Once you merge duplicate nodes and edges, you rarely have the edge types you need to make a problem easy to solve. It turns out the most likely type of edge in a knowledge graph that solves your problem easily is defined by the output of a Python program using the machine learning. For large graphs, this program needs to run on a horizontally scalable platform PySpark and extend rather than be isolated inside a graph databases. The quality of developer's experience is critical. In this talk I will review an approach to an Open Source Large Knowledge Graph Factory built on top of Spark that follows the ingest / build / refine / public / query model that open source big data is based upon.

Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch Application to the Next Level...

MongoDB

MongoDB Stitch is a serverless platform designed to help you easily and securely build an application on top of MongoDB Atlas. It lets developers focus on building applications rather than on managing data manipulation code, service integration, or backend infrastructure. MongoDB Stitch also makes it simple to respond to backend changes immediately, allowing you to simplify client side code and build complex flows more easily. This talk will cover ways that MongoDB Stitch helps you respond to changes in your database and take your applications to the next level.

Running Data Platforms Like Products

VMware Tanzu

Applications need data, but the legacy approach of n-tiered application architecture doesn’t solve for today’s challenges. Developers aren’t empowered to build and iterate their code quickly without lengthy review processes from other teams. New data sources cannot be quickly adopted into application development cycles, and developers are not able to control their own requirements when it comes to data platforms. Part of the challenge here is the existing relationship between two groups: developers and DBAs. Developers are trying to go faster, automating build/test/release cycles with CI/CD, and thrive on the autonomy provided by microservices architectures. DBAs are stewards of data protection, governance, and security. Both of these groups are critically important to running data platforms, but many organizations deal with high friction between these teams. As a result, applications get to market more slowly, and it takes longer for customers to see value. What if we changed the orientation between developers and DBAs? What if developers consumed data products from data teams? In this session, Pivotal’s Dormain Drewitz and Solstice’s Mike Koleno will speak about: - Product mindset and how balanced teams can reduce internal friction - Creating data as a product to align with cloud-native application architectures, like microservices and serverless - Getting started bringing lean principles into your data organization - Balancing data usability with data protection, governance, and security Presenter : Dormain Drewitz, Pivotal & Mike Koleno, Solstice

Challenges of applying Blockchain to enterprise systems in NTTDATA

Hyperleger Tokyo Meetup

Workflow Engines + Luigi

Vladislav Supalov

Vladislav Supalov introduces data pipeline architecture and workflow engines like Luigi. He discusses how custom scripts are problematic for maintaining data pipelines and recommends using workflow engines instead. Luigi is presented as a Python-based workflow engine that was created at Spotify to manage thousands of daily Hadoop jobs. It provides features like parameterization, email alerts, dependency resolution, and task scheduling through a central scheduler. Luigi aims to minimize boilerplate code and make pipelines testable, versioning-friendly, and collaborative.

Mihai Tataran - Building Windows 8 Applications with HTML5 and JS

ITCamp

The document is a presentation about building Windows 8 applications with HTML5 and JavaScript. It discusses the basics of creating Metro style apps, using the WinJS library and controls for the user interface, and key aspects of the Windows 8 platform like app lifecycles, charms, and contracts. It aims to help developers understand what is needed to add to their HTML5 skills to create Windows 8 apps, both by migrating existing apps and building new ones for the platform. The presentation includes demos of Windows 8 app functionality.

GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery

Márton Kodok

Every scientist who needs big data analytics to save millions of lives should have that power. Powering Interactive Data Analysis require massive architecture, and know-how to build a fast real-time computing system. You will learn how Google BigQuery solves this problem by enabling super-fast, SQL queries against petabytes of data using the processing power of Google’s infrastructure. After this session you will be able to work with BigQuery, do streaming inserts, write User Defined Functions in Javascript, and several use cases for everyday developer: funnel analytics, behavioral analytics, exploring unstructured data. You will be able to run arbitrary queries on open-data such as historical data about Github commits, Stackoverflow Q&A data, or analysing Reddit comments to find out books the community talks about.

A Connected Data Landscape: Virtualization and the Internet of Things

Inside Analysis

The Briefing Room with Dr. Robin Bloor and Cisco Live Webcast March 3, 2015 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=a75f0f379405de155800a37b2bf104db Data at rest, data in motion - regardless of its trajectory, data remains the lifeblood of today's information economy. But finding a way to bridge old systems with new opportunities requires an innovative data strategy, one that takes advantage of multiple processing technologies. With the optimal architecture in place, companies can harness years of work in traditional information systems, while opening the door to the flood of new data sources available. Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor, as he explains how data virtualization and other data technologies fundamentally change what's possible with data access, movement and analysis. He'll be briefed by David Besemer of Cisco, who will discuss how this new kind of data strategy can enable the integration of legacy systems, Cloud computing and the Internet of Things. He'll also answer questions about how Big Data and the IoT are helping to redefine the practice of data management. Visis InsideAnalysis.com for more information.

Maximize Big Data ROI via Best of Breed Patterns and Practices

Jeff Bertman

This presentation discusses maximizing ROI from big data technologies and architectures. It introduces the concept of a fitness technology landscape (FiTL) to evaluate different data platform options based on factors like cost. The presentation advocates using a polyglot or best-of-breed approach using multiple technologies to address diverse use cases. This includes using different technologies for extraction, loading, and transformation of data in integrated architectures. Maximizing ROI requires balancing factors like functionality, cost, scalability and other considerations for each specific use case.

MEF Üniversitesi - IoT & Data Dersi

İbrahim KIVANÇ

How a Time Series Database Contributes to a Decentralized Cloud Object Storag...

InfluxData

Big-Data Server Farm Architecture

Jordan Chung

Internet of Things in Tbilisi

Alexey Bokov

High-performance database technology for rock-solid IoT solutions

Clusterpoint

Clusterpoint is a privately held database software company founded in 2006 with 32 employees. Their product is a hybrid operational database, analytics, and search platform that provides secure, high-performance distributed data management at scale. It reduces total cost of ownership by 80% over traditional relational databases by providing blazing fast performance, unlimited scalability, and bulletproof transactions with instant text search and security. Clusterpoint also offers their database software as a cloud database as a service to instantly scale databases on demand.

Similar to Siligong.Data - May 2021 - Transforming your analytics workflow with dbt (20)

Workshop on Google Cloud Data Platform

SoftElegance Services: Data Science, Data Engineering, Big Data Architecture

Continuum Analytics and Python

Reinventing DDC in the Age of Data Analytics

Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...

Demystifying IoT skills : What does it take to become a FullStack IoT engineer?

Data Con LA 2022 - Open Source Large Knowledge Graph Factory

Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch Application to the Next Level...

Running Data Platforms Like Products

Challenges of applying Blockchain to enterprise systems in NTTDATA

Workflow Engines + Luigi

Mihai Tataran - Building Windows 8 Applications with HTML5 and JS

GDG DevFest Ukraine - Powering Interactive Data Analysis with Google BigQuery

A Connected Data Landscape: Virtualization and the Internet of Things

Maximize Big Data ROI via Best of Breed Patterns and Practices

MEF Üniversitesi - IoT & Data Dersi

How a Time Series Database Contributes to a Decentralized Cloud Object Storag...

Big-Data Server Farm Architecture

Internet of Things in Tbilisi

High-performance database technology for rock-solid IoT solutions

Recently uploaded

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

saastr

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...

DanBrown980551

This LF Energy webinar took place June 20, 2024. It featured: -Alex Thornton, LF Energy -Hallie Cramer, Google -Daniel Roesler, UtilityAPI -Henry Richardson, WattTime In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms. This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups. Three primary specifications will be discussed: -Discovery and client registration, emphasizing transparent processes and secure and private access -Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure -Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data

What is an RPA CoE? Session 1 – CoE Vision

DianaGray10

Skybuffer SAM4U tool for SAP license adoption

Tatiana Kojar

Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool. SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.

“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/ Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit. The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers. Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.

High performance Serverless Java on AWS- GoTo Amsterdam 2024

Vadym Kazulkin

Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

Fwdays

At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience

"Scaling RAG Applications to serve millions of users", Kevin Goedecke

Fwdays

A Deep Dive into ScyllaDB's Architecture

ScyllaDB

GNSS spoofing via SDR (Criptored Talks 2024)

Javier Junquera

In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security. This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing. The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Fwdays

Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless. As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency. We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill

LizaNolte

HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable. In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed: Key Takeaways: Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement. Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers. Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.

AppSec PNW: Android and iOS Application Security with MobSF

Ajin Abraham

Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application. This talk covers: Using MobSF for static analysis of mobile applications. Interactive dynamic security assessment of Android and iOS applications. Solving Mobile app CTF challenges. Reverse engineering and runtime analysis of Mobile malware. How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.

Columbus Data & Analytics Wednesdays - June 2024

Jason Packer

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...

Jason Yip

The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.

Y-Combinator seed pitch deck template PP

c5vrf27qcz

Apps Break Data

Ivo Velitchkov

How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...

Alex Pruden

Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security. Paper Link: https://eprint.iacr.org/2024/257

"Choosing proper type of scaling", Olena Syrota

Fwdays

Fueling AI with Great Data with Airbyte Webinar

Zilliz

Recently uploaded (20)

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...

What is an RPA CoE? Session 1 – CoE Vision

Skybuffer SAM4U tool for SAP license adoption

“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...

High performance Serverless Java on AWS- GoTo Amsterdam 2024

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

"Scaling RAG Applications to serve millions of users", Kevin Goedecke

A Deep Dive into ScyllaDB's Architecture

GNSS spoofing via SDR (Criptored Talks 2024)

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill

AppSec PNW: Android and iOS Application Security with MobSF

Columbus Data & Analytics Wednesdays - June 2024

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...

Y-Combinator seed pitch deck template PP

Apps Break Data

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...

"Choosing proper type of scaling", Olena Syrota

Fueling AI with Great Data with Airbyte Webinar

Siligong.Data - May 2021 - Transforming your analytics workflow with dbt

1. Transforming your analytics workflow with dbt #Siligong.Data May 2021 Meetup Jon Su Jon Su @ Internetrix

2. Who knows what SQL is? (pronounced sequel or ss-cue-l depending on who you ask) https://www.commitstrip.com/en/? Jon Su @ Internetrix

3. Tonight’s Talk: dbt (data build tool) ● A History Lesson ● What is dbt? ● Demo: dbt in action Jon Su @ Internetrix

5. ETL was all the craze! https://blog.bismart.com/en/what-do-we-do-etl Jon Su @ Internetrix

6. ● How to build a data infrastructure that can scale ● Controlling storage costs ● Performance Tuning The focus for data teams was on extraction Jon Su @ Internetrix

7. But then came the “Cloud”... This is how the Cloud actually looks like in real- life ;) Jon Su @ Internetrix

8. Jon Su @ Internetrix

9. Industry Trend #1: Move away from “do-it-all-in-1-tool” https://lakefs.io/the-state-of-data-engineering-in-2021/ Jon Su @ Internetrix

10. Industry Trend #2: Shift from ETL to ELT https://www.striim.com/etl-vs-elt/ Jon Su @ Internetrix

11. But there are still problems... https://www.striim.com/etl-vs-elt/ Data Science BI Tools Jon Su @ Internetrix

12. Analytics workflow problems... ● Data consumers don’t have the data when they need it ○ Silos between different members of a traditional data team Data Analyst Data Engineer Business Triangle of Madness Jon Su @ Internetrix

13. Analytics workflow problems... ● Beautiful dashboards that suddenly break when something upstream goes wrong / source schema changes ● Having to rewrite and rewrite the same piece of SQL again & again…. ○ Not sharing analytics code in a team ○ Analysts work in isolation, knowledge isn’t shared ○ Different definitions of a shared metric ● Hard for a business to adopt using BI easily ○ time + $ ○ Low BI adoption Jon Su @ Internetrix

14. Does this have to be the way? Jon Su @ Internetrix

15. Jon Su @ Internetrix

16. dbt (data build tool) lets anyone who knows SQL author their analytics workflow and make their own data pipelines. If you know SQL, you can use it - essentially no barrier to entry Supports a large number of warehouse through adapters ● Can build your own adapter “ Jon Su @ Internetrix

17. Introduces basic software engineering principles to solve the workflow problems we mentioned! Jon Su @ Internetrix

18. Version Control Quality Assurance Modularity Multiple Environments Documentation Automated Tools Code Maintainability https://docs.getdbt.com/docs/about/viewpoint Jon Su @ Internetrix

19. How does dbt work? https://www.getdbt.com/product/ Jon Su @ Internetrix

20. How does dbt work? A dbt project consists of .sql and .yml files: 1. Write dbt code (SQL + Jinja templating) 2. Run dbt command from the CLI or dbt Cloud 3. dbt compiles your dbt code into raw SQL and executes that code against your warehouse. 4. Data is transformed and then created as tables/views back in the data warehouse Jon Su @ Internetrix

21. dbt Core dbt Cloud Jon Su @ Internetrix

22. Demo Scenario: ● Google Merchandise Store Dataset: ● Google Analytics sample dataset for BigQuery Goal: ● Find all Purchases made in Feb 2017 by Users who previously visited the site using a Chrome browser in Jan 2017 Jon Su @ Internetrix

23. ● dbt website: https://getdbt.com ● Demo Source Code: https://github.com/jkersu/dbt-basic-demo ● Get in touch by email at: jon@irx.io The End. Jon Su @ Internetrix

Siligong.Data - May 2021 - Transforming your analytics workflow with dbt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Siligong.Data - May 2021 - Transforming your analytics workflow with dbt

Similar to Siligong.Data - May 2021 - Transforming your analytics workflow with dbt (20)

Recently uploaded

Recently uploaded (20)

Siligong.Data - May 2021 - Transforming your analytics workflow with dbt