In this hands-on webinar we'll explore the data warehousing concept of Slowly Changing Dimensions (SCDs) and common use cases for managing SCDs when dealing with customer data. This webinar will demonstrate different methods for tracking SCDs in a data warehouse, and how Treasure Data Workflow can be used to create robust data pipelines to handle these processes.
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...DataWorks Summit
Apache Hadoop 3.x ushers in major architectural advances such as erasure coding in HDFS, containerized workload flexibility, GPU resource pooling and a litany of other features. These enhancements help drive real benefits when combined with high-speed, high-capacity solid state drives (SSDs).
Micron is a user of Apache Hadoop as well as an innovator in next-gen IT architecture, pushing the envelope on flash storage with the latest 3D NAND. Micron labs have benchmarks showing that adding a single SSD to existing HDD-based cluster nodes can deliver 200% faster Hadoop, at a fraction of the cost of new nodes.
In this session, Micron and Hortonworks will show real-world results demonstrating the tangible benefits of Apache Hadoop 3.x combined with the latest in non-volatile storage and an updated IT infrastructure with NVMe™ solid state drives in well-designed platforms. We will explore specific workloads and application acceleration by combining Apache Hadoop 3.x with SSDs to build analytics platforms that provide a sustainable competitive advantage for many applications to deliver a combination of low latency, high-performance active archives with better results and reduced storage overhead.
Speakers
Saumitra Buragohain, Hortonworks, Sr. Director, Product Management
Mike Cunliffe, Micron Technology, Data Management Architect
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use workload management.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
Dynamic filtering for presto join optimisationOri Reshef
@Roman Zeyde Explains how to optimize Presto Joins in selective use cases.
Roman is a Talpiot graduate and an ex-googler, today working as Varada presto architect.
Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...Databricks
Wejo has largest connected vehicle data set in the world, processing 17bn data points a day. Our data is of value to customers in multiple industries and to customers of multiple sizes. By utilising the Databricks whitelable offering allowing controlled, secure access to our data, we have opened up the unique value of Wejo data to a whole new user base.
How to use flash drives with Apache Hadoop 3.x: Real world use cases and proo...DataWorks Summit
Apache Hadoop 3.x ushers in major architectural advances such as erasure coding in HDFS, containerized workload flexibility, GPU resource pooling and a litany of other features. These enhancements help drive real benefits when combined with high-speed, high-capacity solid state drives (SSDs).
Micron is a user of Apache Hadoop as well as an innovator in next-gen IT architecture, pushing the envelope on flash storage with the latest 3D NAND. Micron labs have benchmarks showing that adding a single SSD to existing HDD-based cluster nodes can deliver 200% faster Hadoop, at a fraction of the cost of new nodes.
In this session, Micron and Hortonworks will show real-world results demonstrating the tangible benefits of Apache Hadoop 3.x combined with the latest in non-volatile storage and an updated IT infrastructure with NVMe™ solid state drives in well-designed platforms. We will explore specific workloads and application acceleration by combining Apache Hadoop 3.x with SSDs to build analytics platforms that provide a sustainable competitive advantage for many applications to deliver a combination of low latency, high-performance active archives with better results and reduced storage overhead.
Speakers
Saumitra Buragohain, Hortonworks, Sr. Director, Product Management
Mike Cunliffe, Micron Technology, Data Management Architect
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use workload management.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
Dynamic filtering for presto join optimisationOri Reshef
@Roman Zeyde Explains how to optimize Presto Joins in selective use cases.
Roman is a Talpiot graduate and an ex-googler, today working as Varada presto architect.
Databricks Whitelabel: Making Petabyte Scale Data Consumable to All Our Custo...Databricks
Wejo has largest connected vehicle data set in the world, processing 17bn data points a day. Our data is of value to customers in multiple industries and to customers of multiple sizes. By utilising the Databricks whitelable offering allowing controlled, secure access to our data, we have opened up the unique value of Wejo data to a whole new user base.
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Amazon Redshift é um serviço gerenciado que lhe dá um Data Warehouse, pronto para usar. Você se preocupa com carregar dados e utilizá-lo. Os detalhes de infraestrutura, servidores, replicação, backup são administrados pela AWS.
Considerations for Data Access in the LakehouseDatabricks
Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
What is Data Warehouse?OLTP vs. OLAP, Conceptual Modeling of Data Warehouses,Data Warehousing Components, Data Warehousing Components, Building a Data Warehouse, Mapping the Data Warehouse to a Multiprocessor Architecture, Database Architectures for Parallel Processing
Learn best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your data warehouse performance.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Best practices of data modeling with InfoSphere Data Architect. Covers best practices of logical data models, physical data models, model comparison, working in a team and more.
Parts of the presentation are from the InfoSphere Data Architect course
http://www.scispike.com/training/infosphere_data_architect_training.html
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
Part 3 - Modern Data Warehouse with Azure SynapseNilesh Gule
Slide deck of the third part of building Modern Data Warehouse using Azure. This session covered Azure Synapse, formerly SQL Data Warehouse. We look at the Azure Synapse Architecture, external files, integration with Azuer Data Factory.
The recording of the session is available on YouTube
https://www.youtube.com/watch?v=LZlu6_rFzm8&WT.mc_id=DP-MVP-5003170
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Amazon Redshift é um serviço gerenciado que lhe dá um Data Warehouse, pronto para usar. Você se preocupa com carregar dados e utilizá-lo. Os detalhes de infraestrutura, servidores, replicação, backup são administrados pela AWS.
Considerations for Data Access in the LakehouseDatabricks
Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
What is Data Warehouse?OLTP vs. OLAP, Conceptual Modeling of Data Warehouses,Data Warehousing Components, Data Warehousing Components, Building a Data Warehouse, Mapping the Data Warehouse to a Multiprocessor Architecture, Database Architectures for Parallel Processing
Learn best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your data warehouse performance.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Best practices of data modeling with InfoSphere Data Architect. Covers best practices of logical data models, physical data models, model comparison, working in a team and more.
Parts of the presentation are from the InfoSphere Data Architect course
http://www.scispike.com/training/infosphere_data_architect_training.html
Containerized Stream Engine to Build Modern Delta LakeDatabricks
As days goes, everything is changing, your business, your analytics platform and your data. So, Deriving the real time insights from this humongous volume of data are key for survival. This robust solution can operate you to the speed of change.
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
AWS User Group: Building Cloud Analytics Solution with AWSDmitry Anoshin
Abebooks is one of Amazon Subsidiary and it treats data as an asset. It always looks the way to improve existing analytics solution and extract information from terabytes of data.
One of the recent initiatives was the migration from legacy DW platform to the AWS Redshift. During this journey, our data engineers met lots of challenges and sometimes tried to reinvent the wheel.
This talk will cover Abebooks journey towards Cloud DW. Moreover, we will cover the ETL tool selection process for the Cloud as well as the adoption process for the end users. This talk will help you understand the potential of the modern cloud DW and learn about our use case and save time for the future projects.
ScenarioXYZ Corp. is a parent corporation with 2 handbag stores l.pdfalokindustries1
Scenario:
XYZ Corp. is a parent corporation with 2 handbag stores located in New Jersey and New York.
XYZ needs to setup a system that will gather customer data from all of the different stores and
put it into one place.
Blindly copying data files from each store is not going to be good enough most of the time.
The corporation needs to have a standardized set of data in order to analyze it.
Task:
You will need to do the following:
1. Create a DataMart Create a table called: DimXYZCustomers
Create using: DimSQLCustomers.sql
2. Build an SSIS - ETL solution to get the data from the 2 stores and load to a single data table
Review/analyze the data from all sources. Source files attached
Determine what needs to be standardized based on the requirements below.
The data collected should be changed to a standard format. For instance, the state value should
all be 2-character value such as NY, NJ etc.
Extract data from all sources
Source file 1:NJCustomers.txt
Source file 2:NYCustomers.csv
State 2 character abbreviation
First and Last names Upper case
Load to the a single data location
Note: custAcct will hold the PK of the source tables, however, CustomerKey is the PK of this
dimension (it will auto increment)
Only good data should go to the database; bad data (assume: no account number) should go to an
error log file
Add annotation to your design space
Be sure to add meaningful names to each object
ETL Project name: LastnameFirstname_Week7Assignment
What to Submit:
1. A zipped folder of the entire solution
2. Screenshot of the ETL solution and Data Warehouse created
Naming convention: firstnameLastname_Week7Assignment.zip
HINT: In addition to the objects we are familiar with, use the transformation component (can be
found in the Common toolbox) Union All
/****** Object: Table [dbo].[DimXYZCustomers] ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[DimXYZCustomers](
[custAcct] [numeric](18, 0) NOT NULL,
[custFirst] [nvarchar](50) NULL,
[custLast] [nvarchar](50) NULL,
[strNum] [nvarchar](50) NULL,
[strName] [nvarchar](50) NULL,
[city] [nvarchar](50) NULL,
[state] [nvarchar](50) NULL,
[zip] [int] NULL,
[creditLimit] [money] NULL,
[CustomerKey] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK_DimXYZCustomers] PRIMARY KEY CLUSTERED
(
[CustomerKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY
= OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
Ny customers.csv
acctNum,firstName,lastName,strNum,strName, city, state, zip,creditLimit
1,Lindon,Jacobs,5, Main St,Miller Place,New York,33176,330.5
2,Charissa,Gaul,34, Azalea Ct,Mt. Sinai,NY,33266,1059
3,Alyssa,wint,22, Sweetgum Lane,Farmingdale,NY,33176,150
4,Brenda,reynolds,45 B, Rocky Rd,Uniondale,New York,33266,5459.35
5,Vishnu,Ruben,123, Candid St,Port Jefferson,NY,33176,235.33
6,Renna,Kelly,4333, Louise Lane, Brookville,New York,33266,459.5
7,Chris,Rusch,12, Main St,Hempstead,NY,33176,2150
8,Lisa,Biolsi,166, Louise L.
SQL Performance Tuning and New Features in Oracle 19cRachelBarker26
What's new in Oracle 19c (and CMiC R12) and the reporting software Jaspersoft Studios. If you are not interested in Jasper go ahead and skip to page 26. Explains how to read an execution plan and what to look for in an optimized execution plan.
Lucene and Solr provide a number of options for query parsing, and these are valuable tools for creating powerful search applications. This presentation given at the 2013 Lucene Revolution will review the role that advanced query parsing can play in building systems, including: Relevancy customization, taking input from user interface variables such as the position on a website or geographical indicators, which sources are to be searched and 3rd party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated. Chief Architect Paul Nelson provides this compelling presentation.
Search Technologies provides relevancy tuning services for Solr. For further information, see http://www.searchtechnologies.com/solr-lucene-relevancy.html
http://www.searchtechnologies.com
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...Sergii Khomenko
Nowadays there are plenty of solution to build a search subsystem. The question is how to keep such a system flexible and easy to react on data-driven decisions, constantly improve the quality. In talk are presented lessons learned from our experience of building lean ranking infrastructure, that could be used with data-driven approach in product development. With slides we walk through the process of scaling out the search system from a couple to 13 countries around the world, but keeping flexibility, that allows to test hypothesis on different levels and perform a/b testing in different dimensions.
Why Standards-Based Drivers Offer Better API IntegrationJerod Johnson
A brief overview of API integration solutions (direct, SDK, middleware, drivers) and an argument in favor of using drivers to solve your integration needs.
Similar to Hands-On: Managing Slowly Changing Dimensions Using TD Workflow (20)
The new GDPR regulation went into effect on May 25th. While a majority of conversations have revolved around the security and IT aspects of the law, marketing teams will play a crucial role in helping organizations meet GDPR standards and playing a strategic role across the organization . Join us to learn more, engage with your peers and get prepared.
This webinar will cover:
- How complying with the GDPR will drive better marketing and raise the standard of the quality of your customer engagement
- The GDPR elements marketers must know about
- The elements of PII that will be affected and what marketers need to do about it
- A deep dive on how GDPR regulations will affect your marketing channels - email, programmatic advertising, cold calls, etc.
- Tactical marketing updates needed to meet GDPR guidelines
AR and VR by the Numbers: A Data First Approach to the Technology and MarketTreasure Data, Inc.
With AR and VR technologies, it’s the first time that data collection has been part of the front-end strategy vs back-end process. As companies compete to create new, interactive experiences, data is the tool of choice to measure all aspects of player engagement and marketing effectiveness. In this webinar, two industry experts, Nicolas Nadeau and Andrew Mayer, will talk about the trends driving AR and VR markets today, and what data-driven approaches companies need to think about to compete in these markets tomorrow.
An overview of Customer Data Platforms (CDP) with the industry leader who coined the term, David Raab. Find out how to use Live Customer Data to create a better customer experience and how Live Data Management can give you a competitive edge with a 360 degree view of your clients.
Learn:
- The definition and requirements for Customer Data Platforms
- The differences between Customer Data Platforms and comparative technologies such as Data Warehousing and Marketing Automation
- Reference architectures/approaches to building CDP
- How Treasure Data is used to build Customer Data Platforms
And here's the song: https://youtu.be/RalMozVq55A
In this hands-on webinar we will cover how to leverage the Treasure Data Javascript SDK library to ensure user stitching of web data into the Treasure Data Customer Data Platform to provide a holistic view of prospects and customers.
We will demo the native SDK, as well as deploying the SDK inside of Adobe DTM and Google Tag Manager.
Brand Analytics Management: Measuring CLV Across Platforms, Devices and AppsTreasure Data, Inc.
Gaming companies with multiple products often struggle to calculate accurate Customer Lifetime Value (CLTV) across their portfolio. This is because user data is often analyzed in silos so companies are unable to get a clear picture of ROI and CLTV across platforms, devices and apps.
In this webinar we’ll look at how you can apply a holistic and complete approach to your CLTV and ROI through the lens of gaming companies, though this technique is applicable for any company who has products spanning platforms.
We’ll also explore:
How the integral power of data in business has shifted over the past 10 years.
Discover the current technologies and processes used to analyze data across different platforms by combining multiple data streams, looking at examples in brand and portfolio-based LTV.
How to process and centralize dozens of varying data streams.
Nicolas Nadeau will speak from his extensive experience and show how leveraging data from multiple product strategies spanning many platforms can be highly beneficial for your company.
Do you know what your top ten 'happy' customers look like? Would you like to find ten more just like them? Come learn how to leverage 1st & 3rd party data to map your customer journey and drive users down a path where every interaction is personalized, fun, & data-driven. No more detractors, power your Customer Experience with data!
In this webinar you will learn:
-When, why, and how to leverage 1st, 2nd, and 3rd party data
-Tips & Tricks for marketers to become more data driven when launching their campaigns
-Why all marketers needs a 360 degree customer view
The reality is virtual, but successful VR games still require cold, hard data. For wildly popular games like Survios’ Raw Data, the first VR-exclusive game to reach #1 on Steam’s Global Top Sellers list, data and analytics are the key to success.
And now online gaming companies have the full-stack analytics infrastructure and tools to measure every aspect of a virtual reality game and its ecosystem in real time. You can keep tabs on lag, which ruins a VR experience, improve gameplay and identify issues before they become showstoppers, and create fully personalized, completely immersive experiences that blow minds and boost adoption, and more. All with the right tools.
Make success a reality: Register now for our latest interactive VB Live event, where we’ll tap top experts in the industry to share insights into turning data into winning VR games.
Attendees will:
* Understand the role of VR in online gaming
* Find out how VR company Survios successfully leverages the Exostatic analytics infrastructure for commercial and gaming success
* Discover how to deploy full-stack analytics infrastructure and tools
Speakers:
Nicolas Nadeau, President, Exostatic
Kiyoto Tamura, VP Marketing, Treasure Data
Ben Solganik, Producer, Survios
Stewart Rogers, Director of Marketing Technology, VentureBeat
Wendy Schuchart, Moderator, VentureBeat
Harnessing Data for Better Customer Experience and Company SuccessTreasure Data, Inc.
As big data has exploded, the ability for companies to easily leverage it has imploded. Organizations are drowning in their own information, unable to see the forest through the trees, while the big players consistently outperform in their ability to deliver a great customer experience, faster, cheaper…As a result, the vast majority of companies are scrambling to catch up and become more agile, data-driven, to use their data more effectively so they can attract and retain their elusive customers...
In this joint deck by 451 Research and Treasure Data, you will learn how to enable your line of business team to own their own data (instead of relying on IT) to be able to:
- deliver a single, persistent view of your customer based on behavior data
- make that data accessible to the right people at the right time
- Increase organizational effectiveness by (finally) breaking down silos with data
- enable powerful marketing tools to enhance the customer experience
How to make your open source project MATTER
Let’s face it: most open source projects die. “For every Rails, Docker and React, there are thousands of projects that never take off. They die in the lonely corners of GitHub, only to be discovered by bots scanning for SSH private keys.
Over the last 5 years, I worked on and off on marketing a piece of infrastructure middleware called Fluentd. We tried many things to ensure that it did not die: From speaking at events, speaking to strangers, giving away stickers, making people install Fluentd on their laptop. Most everything I tried had a small, incremental effect, but there were several initiatives/hacks that raised Fluentd’s awareness to the next level. As I listed up these “ideas that worked”, I noticed the common thread: they all brought Fluentd into a new ecosystem via packaging.”
* 행사 정보 :2016년 10월 14일 MARU180 에서 진행된 '데이터야 놀자' 1day 컨퍼런스 발표 자료
* 발표자 : Dylan Ko (고영혁) Data Scientist / Data Architect at Treasure Data
* 발표 내용
- 데이터사이언티스트 고영혁 소개
- Treasure Data (트레저데이터) 소개
- 데이터로 돈 버는 글로벌 사례 #1
>> MUJI : 전통적 리테일에서 데이터 기반 O2O
- 데이터로 돈 버는 글로벌 사례 #2
>> WISH : 개인화&자동화를 통한 쇼핑 최적화
- 데이터로 돈 버는 글로벌 사례 #3
>> Oisix : 머신러닝으로 이탈고객 예측&방지
- 데이터로 돈 버는 글로벌 사례 #4
>> 워너브로스 : 프로세스 자동화로 시간과 돈 절약
- 데이터로 돈 버는 글로벌 사례 #5
>> Dentsu 등의 애드테크(Adtech) 회사들
- 데이터로 돈을 벌고자 할 때 반드시 체크해야 하는 것
Keynote on Fluentd Meetup Summer
Related Slide
- Fluentd ServerEngine Integration & Windows Support http://www.slideshare.net/RittaNarita/fluentd-meetup-2016-serverengine-integration-windows-support
- Fluentd v0.14 Plugin API Details http://www.slideshare.net/tagomoris/fluentd-v014-plugin-api-details
John Hammink's Talk at Great Wide Open 2016. We discuss: 1.) the need for data analytics infrastructure that can scale exponentially and 2.) what such an infrastructure must contain and finally 3.) the need for an infrastructure to be able to handle un - and semi-structured data.
Treasure Data: Move your data from MySQL to Redshift with (not much more tha...Treasure Data, Inc.
Migrate your semi-structured data from MySQL to Amazon Redshift in as few steps as possible. From Amazon Web Services Bay Area meetup @ Sumo Logic, December 3, 2015.
This presentation describes the common issues when doing application logging and introduce how to solve most of the problems through the implementation of an unified logging layer with Fluentd.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
2. Agenda
● Introduction
● Treasure Data Workflow
● Overview of Slowly Changing Dimensions
● Window Functions
● Handling Type 2 SCDs using Treasure Data
3. Introduction
• Scott Mitchell
• Senior Solution Engineer
• Work with Enterprise clients to
maximize the activation of the
client data
• smitchell@treasure-data.com
4. Introduction
Treasure Data is a Customer Data Platform
“Customer Data Platform (CDP) is a marketer-based management system
that creates a persistent, unified customer database that is accessible to
other systems. Data is pulled from multiple sources, cleaned, and combined
to create a single customer view. This structured data is then made available
to other marketing systems. CDP provides real-time segmentation for
sophisticated personalized marketing.”
https://en.wikipedia.org/wiki/Customer_Data_Platform
5. Our Customer Data Platform: Foundation
Data Management
1st party data
(Your data)
● Web
● Mobile
● Apps
● CRMs
● Offline
2nd & 3rd party DMPs
(enrichment)
Tool Integration
● Campaigns
● Advertising
● Social media
● Reporting
● BI & data
science
ID Unification
Persistent Storage
Workflow Orchestration
ActivationAll Your Data
Segmentation
Profiles Segments
Measurement
7. DATA ORCHESTRATION AND WORKFLOW MANAGEMENT
•Workflow management across data input, processing and output
•Supports both scheduled & trigger-based execution
•Cloud-based and Client-hosted. Client-hosted version can run custom code.
•Cloud-based version has both web UI & REST API
The core engine is built on our open source project
Digdag
8. Treasure Workflow allow users to build repeatable data processing pipelines that consist of
Treasure Data jobs.
Overview
9. Why use Treasure Workflow?
1. Enhanced Organization
• Organize your processing workflows into groups of similarly-purposed tasks
2. Reduce Errors
• No longer must manage dependencies by scheduled-time alone
3. Ease Error Handling
• Split large scripts & queries into smaller, more manageable, jobs
4. Improve Collaboration
• Organize your job flows into projects
Benefits
10. WORKFLOW DEFINITION: CLOSER LOOK
timezone: Asia/Tokyo
schedule:
daily>: 07:00:00
_export:
td:
database: nishi
+load:
td_load>: import/s3_load.yml
database: nishi
table: monthly_goods_sales
+daily:
td>: queries/daily_open.sql
create_table: daily_open
+monthly:
td>: queries/monthly_open.sql
result_connection: nishi_s3
result_settings:
bucket: nishitetsu-test
path: /monthly_open.csv
•File extension should be “.dig” ‘to be
recognized as workflow
•Standard YAML
•Task names are prefixed by “+”
•Operators are postfixed by “>”
•Schedules can be set with schedule
•Variables are supported via ${variable_name}
11. REPRESENTATIVE OPERATORS
Category Name Description
Control Flow
call>: Call another workflow
loop>: Repeat tasks a specified # of times
for_each>: Loop through a specified list
if>: if/else control flow
Treasure Data
td>: Run a specified TD query
td_run>: Run a saved query
td_ddl>: Create, delete, rename, truncate tables
td_load>: Invoke an input data transfer
td_for_each>: Loop through a query result row by row
AWS
s3_wait>: Wait for new files in S3 & download
redshift>: Run Redshift query
redshift_load>: Load data into Redshift
redshift_unload>: Unload data from Redshift
Google Cloud Platform
bq>: Run BigQuery query
bq_extract>: Unload data from BigQuery to GCS
13. Slowly Changing Dimensions
• Particular dimensions within a dataset that are prone to change
unpredictably
• Example: the phone number or email field of a CRM dataset
• Data available from a CRM usually represents the current, up-to-date value
of each field for each customer
• Storing a history this customer data requires managing these slowly
changing dimensions (SCDs)
15. Type 1: Overwrite the field
company_id company_name company_state
123 Sterling Cooper New York
Old Record:
16. Type 1: Overwrite the field
company_id company_name company_state
123 Sterling Cooper New York
Old Record:
New Record:
company_id company_name company_state
123 Sterling Cooper California
17. Type 1: Overwrite the field
company_id company_name company_state
123 Sterling Cooper New York
Old Record:
New Record:
company_id company_name company_state
123 Sterling Cooper California
SCD Type 1:
company_id company_name company_state
123 Sterling Cooper California
18. Type 2: Keep both records, flag the “current” row
company_id company_name company_state
123 Sterling Cooper New York
Old Record:
New Record:
company_id company_name company_state
123 Sterling Cooper California
SCD Type 2:
company_id company_name company_state is_current
123 Sterling Cooper New York 0
123 Sterling Cooper California 1
19. Type 3: Store the latest two values in one row
company_id company_name company_state
123 Sterling Cooper New York
Old Record:
New Record:
company_id company_name company_state
123 Sterling Cooper California
SCD Type 3:
company_id company_name company_state_current company_state_previous
123 Sterling Cooper California New York
20. Type 4: Use a separate history table
SCD Type 4:
company_id company_name company_state
123 Sterling Cooper California
company
company_id company_name company_state last_modified_date
123 Sterling Cooper New York 2007-06-19
123 Sterling Cooper California 2008-10-12
company_history
22. Type 2: Keep both records, flag the “current” row
company_id company_name company_state
123 Sterling Cooper New York
Old Record:
New Record:
company_id company_name company_state
123 Sterling Cooper California
SCD Type 2:
company_id company_name company_state is_current
123 Sterling Cooper New York 0
123 Sterling Cooper California 1
23. Type 2: Keep both records, flag the “current” row
company_id company_name company_state lastmodifieddate
123 Sterling Cooper New York 2007-06-19
Old Record:
New Record:
SCD Type 2:
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 0
123 Sterling Cooper California 2008-10-12 1
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
24. Type 2: Keep both records, flag the “current” row
company_id company_name company_state lastmodifieddate
123 Sterling Cooper New York 2007-06-19
Old Record:
New Record:
SCD Type 2:
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 0
123 Sterling Cooper California 2008-10-12 1
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
25. Window Functions
• Window functions perform calculations across rows of the query result
• They run after the ‘HAVING’ clause but before the ‘ORDER BY’ clause
• They are written in the ‘SELECT’ clause and display results in their own
column
• They have three parts:
26. Window Functions
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate DESC)
ordering specificationfunction partition specification
27. Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate DESC) AS isCurrent
FROM company
company_id company_name company_state lastmodifieddate
123 Sterling Cooper New York 2007-06-19
123 Sterling Cooper California 2008-10-12
company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper California 2008-10-12 1
123 Sterling Cooper New York 2007-06-19 2
28. Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate DESC) AS isCurrent
FROM company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper California 2008-10-12 1
123 Sterling Cooper New York 2007-06-19 2
124 CGC Connecticut 2018-05-22 1
124 CGC New York 2010-08-22 2
29. Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate ASC) AS isCurrent
FROM company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper California 2008-10-12 2
123 Sterling Cooper New York 2007-06-19 1
124 CGC Connecticut 2018-05-22 2
124 CGC New York 2010-08-22 1
30. Window Functions
SELECT
company_id,
company_name,
company_state,
rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate ASC) AS isCurrent
FROM company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper California 2008-10-12 1
123 Sterling Cooper New York 2007-06-19 0
124 CGC Connecticut 2018-05-22 1
124 CGC New York 2010-08-22 0
31. Window Functions
SELECT
company_id,
company_name,
company_state,
CASE WHEN rank() OVER (PARTITION BY company_id ORDER BY lastmodifieddate ASC) = 1 THEN 1 ELSE 0 AS END as isCurrent
FROM company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper California 2008-10-12 1
123 Sterling Cooper New York 2007-06-19 0
124 CGC Connecticut 2018-05-22 1
124 CGC New York 2010-08-22 0
32. Implementation in Treasure Data
1. Load incremental data from a data source to a staging table
1. Drop the target table that contains outdated SCD information
1. Window over the staging table, rebuilding the target table with the latest
SCD information
33. Implementation in Treasure Data
1. Load incremental data from a data source to a staging table
1. Drop the target table that contains outdated SCD information
1. Window over the staging table, rebuilding the target table with the latest
SCD information
35. Implementation in Treasure Data
company_id company_name company_state lastmodifieddate
123 Sterling Cooper New York 2007-06-19
124 CGC New York 2010-08-22
staging_company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 1
124 CGC New York 2010-08-22 1
target_company
36. Implementation in Treasure Data
company_id company_name company_state lastmodifieddate
123 Sterling Cooper New York 2007-06-19
124 CGC New York 2010-08-22
123 Sterling Cooper California 2008-10-12
124 CGC Connecticut 2018-05-22
staging_company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 1
124 CGC New York 2010-08-22 1
target_company
37. Implementation in Treasure Data
company_id company_name company_state lastmodifieddate
123 Sterling Cooper New York 2007-06-19
124 CGC New York 2010-08-22
123 Sterling Cooper California 2008-10-12
124 CGC Connecticut 2018-05-22
staging_company
target_company
38. Implementation in Treasure Data
company_id company_name company_state lastmodifieddate
123 Sterling Cooper New York 2007-06-19
124 CGC New York 2010-08-22
123 Sterling Cooper California 2008-10-12
124 CGC Connecticut 2018-05-22
staging_company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper California 2008-10-12 1
123 Sterling Cooper New York 2007-06-19 0
124 CGC Connecticut 2018-05-22 1
124 CGC New York 2010-08-22 0
target_company
40. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 1
124 CGC New York 2010-08-22 1
target_company
41. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 1
124 CGC New York 2010-08-22 1
target_company
1. Store a temp table of the current rows that will not be current after the new data is
ingested
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 0
tmp_no_longer_current
42. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 1
124 CGC New York 2010-08-22 1
target_company
1. Store a temp table of the current rows that will not be current after the new data is
ingested
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 0
tmp_no_longer_current
43. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
124 CGC New York 2010-08-22 1
target_company
2. Delete from the data lake any current rows that have a matching id in the new data
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 0
tmp_no_longer_current
44. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
124 CGC New York 2010-08-22 1
target_company
3. Insert the temp rows into the target table
company_id company_name company_state lastmodifieddate is_current
123 Sterling Cooper New York 2007-06-19 0
tmp_no_longer_current
45. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
124 CGC New York 2010-08-22 1
123 Sterling Cooper New York 2007-06-19 0
target_company
3. Insert the temp rows into the target table
46. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
124 CGC New York 2010-08-22 1
123 Sterling Cooper New York 2007-06-19 0
target_company
3. Insert the temp rows into the target table
47. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
123 Sterling Cooper California 2008-10-12
staging_company
company_id company_name company_state lastmodifieddate is_current
124 CGC New York 2010-08-22 1
123 Sterling Cooper New York 2007-06-19 0
target_company
4. Insert the new data into the target table
48. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
staging_company
company_id company_name company_state lastmodifieddate is_current
124 CGC New York 2010-08-22 1
123 Sterling Cooper New York 2007-06-19 0
123 Sterling Cooper California 2008-10-12 1
target_company
4. Insert the new data into the target table
49. SCD Type 2 Workflow with Persistent Architecture
company_id company_name company_state lastmodifieddate
staging_company
company_id company_name company_state lastmodifieddate is_current
124 CGC New York 2010-08-22 1
123 Sterling Cooper New York 2007-06-19 0
123 Sterling Cooper California 2008-10-12 1
target_company
4. Insert the new data into the target table