BigQuery at AppsFlyer - past, present and future

•Download as ODP, PDF•

0 likes•649 views

- The document discusses the past, present, and future of using BigQuery for mobile campaign analytics. In the past, they used small Python services and CouchDB which had performance issues. They started using BigQuery which improved performance but had some limitations and cost challenges. Through optimizations like unified schemas and table decorators, they addressed these issues. Going forward, they are waiting for custom partitioning functions in BigQuery to further improve performance and reduce costs.

Storytelling: BigQuery - past, present and future
Mobile Campaign Analytics | Retargeting | Unbiased Attribution
Nir Rubinstein - Chief Architect. nir@appsflyer.com

State of the union – 2012
Tech stack:
• Small, semi-isolated python services
• Communication via Redis' pub/sub
• Main DB is CouchDB
• Not a whole lot going on...

State of the union – 2012, cont'd
The problem:
• A really big raw(!) report is being served by CouchDB
• Data in CouchDB is being generated via a View
• The entire DB is on hold while a view is generated
• Pissed off client
What do we do???

The Quick Solution
Google Bigquery
• Hosted solution – we don't have to manage it
• Based on Google's BigTable whitepaper
• Columnar storage DB
• Really easy to start working with...

The Modeling Problem
Is there one?
• Data in Bigquery is divided into:
●
Projects
●
Datasets
●
Tables
How to split the data?
Is performance constant?
How am I charged?

The Modeling Problem
The naive approach:
One Project
One Dataset
Multiple Tables
Hundred of thousands of rows a month

The Modeling Problem
What are the limitations?
• Cannot change the schema of a table once created
• Cannot update table data
• Cannot delete table data
Essentially – forward writing only!

The Modeling Problem
How we tackled these “issues”?
• One global “Project” that drives our Raw Reports
• New Datasets are created every 30 days -
business and cost limitations
• Tables in the datasets are versioned!

The Cost Problem
Is there a problem?
• Storage is very cheap, querying is expansive
• Querying is billed by the amount of data scanned.
Bigquery is a columnar DB which means that every
column that participates in the query is read from
beginning to end
• Once tables start storing a lot of data, even simple queries
with very few columns will be expensive

Cost Optimizations
• We're processing 8B daily events. Out of those, a few
hundred millions are written into Bigquery – only
meaningful data.
• We've created a unified schema to prevent table version
issues
• Tables can be query optimized via Table Decorators to
limit the time range of queried data
• Tables will be named with dates in them in order to
support Table Wildcard queries in order to reduce the cost

The “Cost” of Cost Optimizations
• Performance (querying multiple tables)
• Over engineering (inserting to and maintaining multiple
tables)
• Storage is cheap, but querying is costly. Since querying
does a full column scan, there's a debate whether we
should store the entire data or parts of it.

The Future
What are we waiting for?
Custom partitioning functions!!!

At Intuit, we have a lot of data – and a lot of duplicate data collected over decades. So we built a rule-based, self-serve tool to identify and merge duplicate records. It takes experimentation and iteration to get deduplication just right for 100s of millions of records, and spreadsheet-based tracking just wasn’t enough. We now use MLflow to automatically capture execution notes, rule settings, weights, key validation metrics, etc., all without requiring end-user action. In this talk, we’ll talk about our use case and why MLflow is useful outside its traditional ML Ops use cases.

The Future of Data Engineering - 2019 InfoQ QConSF

Chris Riccomini

The document summarizes the six stages of data pipeline maturity: from no pipeline (Stage 0) to fully decentralized pipelines (Stage 5). It uses the evolution of data infrastructure at WePay as an example, starting with a monolithic MySQL database (Stage 0) and progressing to a real-time integration of data across services using Kafka and automated operations/data management (Stages 3-4). The talk argues that as pipelines become fully automated, data teams should expose self-service tools to decentralize control and management of data to individual engineering teams through micro data warehouses.

Building the Serverless Container Experience: Kevin McGrath, Spotinst, Server...

iguazio

The document discusses combining functions as a service with containers to build a serverless container experience. Some key points made include: - Containers and functions each have their own use cases and patterns that could be combined - Using containers allows reusing existing tools for microservices while getting benefits of functions like scaling to zero and pay per use billing - A serverless experience with containers could bridge the gap between monolithic apps and microservices by allowing no infrastructure management, scaling by request, and utility billing - The future of serverless may involve running functions in containers on any platform for a more flexible serverless experience

The Fermilab HEPCloud Facility

Claudio Pontili

The document discusses the Fermilab HEPCloud facility, which provides computing resources for high energy physics experiments. HEPCloud integrates commercial cloud resources from Amazon Web Services (AWS) with Fermilab's physically owned resources to provide elastic computing capacity. This allows experiments to burst to peak usage levels when needed. Several challenges are discussed around optimizing performance, provisioning, storage, networking, and monitoring when running scientific workflows on AWS. Examples of experiments using HEPCloud include NOvA processing datasets, searches for gravitational wave counterparts by the Dark Energy Survey, and CMS Monte Carlo simulations. HEPCloud aims to provide resources efficiently whether demand is high or low.

The Problem is Data: Gwen Shapira, Confluent, Serverless NYC 2018

iguazio

Serverless architectures allow automatic scaling of compute resources and pay-per-use pricing models. While scalability is often a key benefit, serverless functions have limitations regarding state management. Event-driven serverless platforms can be viewed as a form of stream processing. To handle stateful operations, the data model should be designed based on how data will be queried. An "inside out database" approach treats each event as data and action, and uses the database primarily for data storage rather than querying. This mirrors an CQRS pattern and allows scaling to infinite event streams.

ApacheKylin_HBaseCon2015

Luke Han

Big problems Big Data, simple solutions

Claudio Pontili

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.

Data Warehousing Patterns for Hadoop

Michelle Ufford

Slides from Michelle Ufford's Data Warehousing talk at Hadoop Summit 2015. How can we take advantage of the veritable treasure trove of data stored in Hadoop to augment our traditional data warehouses? In this session, Michelle will share her experience with migrating GoDaddy’s data warehouse to Hadoop. She’ll explore how GoDaddy has adapted traditional data warehousing methodologies to work with Hadoop and will share example ETL patterns used by her team. Topics will also include how the integration of structured and unstructured data has exposed new insights, the resulting business impact, and tips for making your own Hadoop migration project more successful. Recording available here: https://www.youtube.com/watch?v=0AxoB-wJcZc

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)

Albert Wong

Building a data platform doesn’t have to be like entering a portal to Stranger Things. Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale. Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.

Automate your data flows with Apache NIFI

Adam Doyle

Apache Nifi is an open source dataflow platform that automates the flow of data between systems. It uses a flow-based programming model where data is routed through configurable "processors". Nifi was donated to the Apache Foundation by the NSA in 2014 and has over 285 processors to interact with data in various formats. It provides an easy to use UI and allows users to string together processors to move and transform data within "flowfiles" through the system in a secure manner while capturing detailed provenance data.

Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018

iguazio

This document discusses serverless computing and introduces Nuclio, an open source serverless platform. Some key points: - Serverless platforms provide easy deployment of functions but lack performance and number of event sources. Nuclio aims to improve on this with high concurrency and low latency. - Nuclio's architecture allows extreme performance of up to 400,000 events/second per process with sub-second latency. It supports various event sources and data bindings. - Nuclio works with Kubernetes, providing portability across clouds, on-premises, and hybrid environments while automating infrastructure management and scaling.

The Serverless Native Mindset: Ben Kehoe, iRobot, Serverless NYC 2018

iguazio

Serverless architecture is a fantastic enabler for lean teams, smaller bills, and more robust systems, but its fundamental advantage is that it allows your organization to focus on creating business value, not solving technology problems. However, gaining this advantage involves giving up control over your technology stack in favor of managed services, and this organizational aspect is more difficult than any of the technological hurdles faced by serverless developers. In this talk, I'll explain how to adopt a mindset that embraces serverless and why, in spite of these pitfalls, serverless architecture is absolutely worth the effort.

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...

Databricks

How did Devon move from a traditional reporting and data warehouse approach to a modern data lake? What did it take to go from a slow and brittle technical landscape to an a flexible, scalable, and agile platform? In the past, Devon addressed data solutions in dozens of ways depending on the user and the requirements. Through a visionary program, driven by Databricks, Devon has begun a transformation of how it consumes data and enables engineers, analysts, and IT developers to deliver data driven solutions along all levels of the data analytics spectrum. We will share the vision, technical architecture, influential decisions, and lessons learned from our journey. Join us to hear the unique Databricks success story at Devon.

Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...

Databricks

[Codemash] Caching Made "Bootiful"!

Viktor Gamov

Needing more performance from your Java applications? Is latency causing you stress? Repetitive loading of the data in applications, burning CPU time, taxing I/O / disk access? Too many applications caching the same data sets pushing the limits of your data management and application architecture? If so, take a look at JCache! This code-driven session demonstrates how to integrate Hazelcast Distributed Caches into your Spring or Java EE applications. Adding a bunch of annotations to a method can achieve an orders-of-magnitude speed improvement in applications with high-latency.

Bridging the Completeness of Big Data on Databricks

Databricks

Data completeness is key for building any machine learning and deep learning model. The reality is that outliers and nulls widely exist in the data. The traditional methods of using fixed values or statistical metrics (min, max and mean) does not consider the relationship and patterns within the data. Most time it offers poor accuracy and would introduce additional outliers. Also, given our large data size, the computation is an extremely time-consuming process and a lot of time it could be constrained by the limited resource on local computer. To address those issues, we have developed a new approach that will first leverage the similarity within our data points based on the nature of data source then using a collaborative AI model to fill null values and correct outliers. In this talk, we will walk through the way we use a distributed framework to partition data by KDB tree for neighbor discovery and a collaborative filtering AI technology to fill the missing values and correct outliers. In addition, we will demonstrate how we reply on delta lake and MLflow for data and model management.

IPC Global Big Data To Decision Solution Overview

pzybrick

This document discusses IPC Global's enterprise intelligence solutions for processing big data using Cloudera and AWS tools. It outlines an end-to-end example using randomly generated data to demonstrate loading data into HDFS, processing it with MapReduce, selectively reducing the data, loading it into data warehouses, and creating reports with QlikView. IPC Global provides capabilities including a Cloudera CDH5 cluster, AWS EMR, database servers, and tools for data generation, ETL, and demonstration programs to validate hybrid on-premise and cloud big data pipelines.

Microsoft Machine Learning Smackdown

Lynn Langit

Building Data Quality Audit Framework using Delta Lake at Cerner

Databricks

Feature store Overview St. Louis Big Data IDEA Meetup aug 2020

Adam Doyle

Accelerate Your ML Pipeline with AutoML and MLflow

Databricks

Building ML models is a time consuming endeavor that requires a thorough understanding of feature engineering, selecting useful features, choosing an appropriate algorithm, and performing hyper-parameter tuning. Extensive experimentation is required to arrive at a robust and performant model. Additionally, keeping track of the models that have been developed and deployed may be complex. Solving these challenges is key for successfully implementing end-to-end ML pipelines at scale. In this talk, we will present a seamless integration of automated machine learning within a Databricks notebook, thus providing a truly unified analytics lifecycle for data scientists and business users with improved speed and efficiency. Specifically, we will show an app that generates and executes a Databricks notebook to train an ML model with H2O’s Driverless AI automatically. The resulting model will be automatically tracked and managed with MLflow. Furthermore, we will show several deployment options to score new data on a Databricks cluster or with an external REST server, all within the app.

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

Databricks

The term “Lambda Architecture” stands for a generic, scalable and fault-tolerant data processing architecture. As the hyper-scale now offers a various PaaS services for data ingestion, storage and processing, the need for a revised, cloud-native implementation of the lambda architecture is arising. In this talk we demonstrate the blueprint for such an implementation in Microsoft Azure, with Azure Databricks — a PaaS Spark offering – as a key component. We go back to some core principles of functional programming and link them to the capabilities of Apache Spark for various end-to-end big data analytics scenarios. We also illustrate the “Lambda architecture in use” and the associated tread-offs using the real customer scenario – Rijksmuseum in Amsterdam – a terabyte-scale Azure-based data platform handles data from 2.500.000 visitors per year.

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Jeff Magnusson

Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users. From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.

Data-Driven @ Netflix

Michelle Ufford

Michelle Ufford is a Principal Architect at Netflix who leads their Data Engineering and Analytics team. Netflix has over 86 million members who watch over 125 million hours of content daily on over 1000 supported devices. The data team manages a 40 petabyte data warehouse with 4 petabyte daily reads and 300 terabyte daily writes, processing over 700 billion events. The data is used to predict content value, optimize the user experience, analyze news and PR, monitor global service delivery, and power experimentation.

Presto Summit 2018 - 07 - Lyft

kbajda

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...

Databricks

How did eBay move their ETL computation from conventional RDBMS environment over to Spark? What did it take to go from a strategic vision to a viable solution? This paper will take you through a journey which lead to an implementation of a 1000+ node Spark Cluster running 10,000+ ETL jobs daily, all done in a span of less than 6 months, by a team with limited Spark experience. We will share the vision, technical architecture, critical Management decisions, Challenges and Road ahead. This will be a unique opportunity to look into this awesome Spark success story at eBay!

bigdata.pptx

VIJAYAPRABAP

1. Hadoop is a software platform that allows for the distributed storage and processing of extremely large datasets across clusters of commodity hardware. 2. It addresses problems like parallel processing, fault tolerance, and scalability to reliably handle data at the petabyte scale. 3. Using Hadoop's core components - the Hadoop Distributed File System (HDFS) for storage and MapReduce for processing - it can efficiently distribute data and computation across large clusters to enable analysis of big data.

bigdata.pdf

AnjaliKumari301316

1. Big data refers to large and complex datasets that are difficult to process using traditional database and software techniques. 2. Hadoop is an open-source software platform that allows distributed processing of large datasets across clusters of computers. It solves the problems of big data by dividing it across nodes and processing it in parallel using MapReduce. 3. Hadoop provides reliable and scalable storage of big data using HDFS and efficient parallel processing of that data using MapReduce, allowing organizations to gain insights from large and diverse datasets.

What's hot

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Databricks

Data Warehousing Patterns for Hadoop

Michelle Ufford

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)

Albert Wong

Automate your data flows with Apache NIFI

Adam Doyle

Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018

iguazio

The Serverless Native Mindset: Ben Kehoe, iRobot, Serverless NYC 2018

iguazio

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...

Databricks

Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...

Databricks

[Codemash] Caching Made "Bootiful"!

Viktor Gamov

Bridging the Completeness of Big Data on Databricks

Databricks

IPC Global Big Data To Decision Solution Overview

pzybrick

Microsoft Machine Learning Smackdown

Lynn Langit

Building Data Quality Audit Framework using Delta Lake at Cerner

Databricks

Feature store Overview St. Louis Big Data IDEA Meetup aug 2020

Adam Doyle

Accelerate Your ML Pipeline with AutoML and MLflow

Databricks

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

Databricks

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Jeff Magnusson

Data-Driven @ Netflix

Michelle Ufford

Presto Summit 2018 - 07 - Lyft

kbajda

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...

Databricks

What's hot (20)

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Data Warehousing Patterns for Hadoop

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)

Automate your data flows with Apache NIFI

Serverless and AI: Orit Nissan-Messing, Iguazio, Serverless NYC 2018

The Serverless Native Mindset: Ben Kehoe, iRobot, Serverless NYC 2018

Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...

Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...

[Codemash] Caching Made "Bootiful"!

Bridging the Completeness of Big Data on Databricks

IPC Global Big Data To Decision Solution Overview

Microsoft Machine Learning Smackdown

Building Data Quality Audit Framework using Delta Lake at Cerner

Feature store Overview St. Louis Big Data IDEA Meetup aug 2020

Accelerate Your ML Pipeline with AutoML and MLflow

Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Data-Driven @ Netflix

Presto Summit 2018 - 07 - Lyft

Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...

Similar to BigQuery at AppsFlyer - past, present and future

bigdata.pptx

VIJAYAPRABAP

bigdata.pdf

AnjaliKumari301316

Introduction to GCP BigQuery and DataPrep

Paweł Mitruś

This document provides an overview and summary of Google Cloud Platform's BigQuery and DataPrep services. For BigQuery, it covers what it is used for, pricing models, how to load and query data, best practices, and limits. For DataPrep, it discusses what it is used for, pricing, how to wrangle and transform data using recipes and flows, performance optimization tips, and limits. Live demos are provided of loading, querying data in BigQuery and using transformations in DataPrep. The presenter's contact information is also included at the end.

Google BigQuery is the future of Analytics! (Google Developer Conference)

Rasel Rana

How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...

Remy Rosenbaum

Jethro CTO Boaz Raufman and Jethro CEO Eli Singer discuss the performance benefits of adding auto microcubes to the processing framework in Jethro 2.0. They discuss how the auto microcubes working in tandem with full indexing and a smart caching engine deliver a consistently interactive-speed business intelligence experience across most scenarios and use cases. The main use case they discuss is querying data on Hadoop directly from a BI tool such as Tableau or Qlik.

Exploring BigData with Google BigQuery

Dharmesh Vaya

A data analyst view of Bigdata

Venkata Reddy Konasani

The document provides an introduction to the concepts of big data and how it can be analyzed. It discusses how traditional tools cannot handle large data files exceeding gigabytes in size. It then introduces the concepts of distributed computing using MapReduce and the Hadoop framework. Hadoop makes it possible to easily store and process very large datasets across a cluster of commodity servers. It also discusses programming interfaces like Hive and Pig that simplify writing MapReduce programs without needing to use Java.

PCM18 (Big Data Analytics)

Stratebi

Snowflake: The Good, the Bad, and the Ugly

Tyler Wishnoff

Business Intelligence and Multidimensional Database

Russel Chowdhury

Big Data

Mahesh Bmn

bigquery.pptx

Harissh16

BigQuery is Google Cloud Platform's interactive big data service that allows users to analyze massive datasets in seconds using SQL-like queries. It offers a scalable and fast way to query terabytes of data without the expense of maintaining servers or databases. BigQuery organizes data into a project-dataset-table hierarchy and uses a distributed architecture to efficiently process queries across servers.

A peek into the future

Prateek Chauhan

This document discusses trends in database technologies and big data. It begins by questioning whether relational databases are always the most efficient way to store data and whether companies like Facebook use traditional databases. It then covers object-relational mapping (ORM), which provides a bridge between object-oriented programming languages and relational databases. The document also discusses NoSQL databases as an alternative for handling large, unstructured "big data" and provides examples like document stores, key-value stores, and graph databases. It concludes by defining big data and discussing the types, analysis, and market size of big data technologies.

Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015

Vladi Vexler

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Dmitry Anoshin

This session will cover building the modern Data Warehouse by migration from the traditional DW platform into the cloud, using Amazon Redshift and Cloud ETL Matillion in order to provide Self-Service BI for the business audience. This topic will cover the technical migration path of DW with PL/SQL ETL to the Amazon Redshift via Matillion ETL, with a detailed comparison of modern ETL tools. Moreover, this talk will be focusing on working backward through the process, i.e. starting from the business audience and their needs that drive changes in the old DW. Finally, this talk will cover the idea of self-service BI, and the author will share a step-by-step plan for building an efficient self-service environment using modern BI platform Tableau.

bigdata (1)

DIVYA G

This document discusses big data applications in engineering. It begins with an introduction to big data and its 3 V's (volume, velocity, variety). It then discusses how big data is important for design engineering, providing Boeing as an example. Key reasons why big data is important include increasing innovation, improving customer satisfaction, and gaining competitive advantages. The document also covers how big data relates to cloud computing, ecommerce, and product lifecycle management. It concludes that big data is an important and growing area that can provide benefits across many industries including manufacturing.

Intro to Big Data

Zohar Elkayam

Google BigQuery

Matthias Feys

Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...

Rittman Analytics

As big data and data warehousing scale-up and move into the cloud, they’re increasingly likely to be delivered as services using distributed cloud query engines such as Google BigQuery, loaded using streaming data pipelines and queried using BI tools such as Looker. In this session the presenter will walk through how data modelling and query processing works when storing petabytes of customer event-level activity in a distributed data store and query engine like BigQuery, how data ingestion and processing works in an always-on streaming data pipeline, how additional services such as Google Natural Language API can be used to classify for sentiment and extract entity nouns from incoming unstructured data, and how BI tools such as Looker and Google Data Studio bring data discovery and business metadata layers to cloud big data analytics

Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions

Looker

Infectious Media runs on data. But, as an ad-tech company that records hundreds of thousands of web events per second, they have have to deal with data at a scale not seen by most companies. You can not make decisions with data when people need to write manual SQL only for queries take 10-20 minutes to return. Infectious Media made the switch to Google BigQuery and Looker and now every member of every team can get the data they need in seconds. Infectious Media shares: - Why they chose their current stack - Why faster data means happier customers - Advantages and practical implications of storing and processing that much data Check out the recording at https://info.looker.com/h/i/308848878-power-to-the-people-a-stack-to-empower-every-user-to-make-data-driven-decisions

Similar to BigQuery at AppsFlyer - past, present and future (20)

bigdata.pptx

bigdata.pdf

Introduction to GCP BigQuery and DataPrep

Google BigQuery is the future of Analytics! (Google Developer Conference)

How Auto Microcubes Work with Indexing & Caching to Deliver a Consistently Fa...

Exploring BigData with Google BigQuery

A data analyst view of Bigdata

PCM18 (Big Data Analytics)

Snowflake: The Good, the Bad, and the Ugly

Business Intelligence and Multidimensional Database

Big Data

bigquery.pptx

A peek into the future

Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

bigdata (1)

Intro to Big Data

Google BigQuery

Budapest Data Forum 2017 - BigQuery, Looker And Big Data Analytics At Petabyt...

Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions

Recently uploaded

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

Peter Muessig

The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.

Revolutionizing Visual Effects Mastering AI Face Swaps.pdf

Undress Baby

The quest for the best AI face swap solution is marked by an amalgamation of technological prowess and artistic finesse, where cutting-edge algorithms seamlessly replace faces in images or videos with striking realism. Leveraging advanced deep learning techniques, the best AI face swap tools meticulously analyze facial features, lighting conditions, and expressions to execute flawless transformations, ensuring natural-looking results that blur the line between reality and illusion, captivating users with their ingenuity and sophistication. Web:- https://undressbaby.com/

Energy consumption of Database Management - Florina Jonuzi

Green Software Development

KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD

rodomar2

socradar-q1-2024-aviation-industry-report.pdf

SOCRadar

SOCRadar's Aviation Industry Q1 Incident Report is out now! The aviation industry has always been a prime target for cybercriminals due to its critical infrastructure and high stakes. In the first quarter of 2024, the sector faced an alarming surge in cybersecurity threats, revealing its vulnerabilities and the relentless sophistication of cyber attackers. SOCRadar’s Aviation Industry, Quarterly Incident Report, provides an in-depth analysis of these threats, detected and examined through our extensive monitoring of hacker forums, Telegram channels, and dark web platforms.

E-commerce Application Development Company.pdf

Hornet Dynamics

GreenCode-A-VSCode-Plugin--Dario-Jurisic

Green Software Development

Using Query Store in Azure PostgreSQL to Understand Query Performance

Grant Fritchey

Using Xen Hypervisor for Functional Safety

Ayan Halder

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

Green Software Development

What is Master Data Management by PiLog Group

aymanquadri279

Fundamentals of Programming and Language Processors

Rakesh Kumar R

Oracle Database 19c New Features for DBAs and Developers.pptx

Remote DBA Services

SMS API Integration in Saudi Arabia| Best SMS API Service

Yara Milbes

Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

Aftab Hussain

Understanding variable roles in code has been found to be helpful by students in learning programming -- could variable roles help deep neural models in performing coding tasks? We do an exploratory study. - These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia

SWEBOK and Education at FUSE Okinawa 2024

Hironori Washizaki

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

timtebeek1

OpenMetadata Community Meeting - 5th June 2024

OpenMetadata

The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features. * How to run your own data quality framework * What is the performance impact of running data quality frameworks * How to run the test cases in your own ETL pipelines * How the Incident Manager is integrated * Get notified with alerts when test cases fail Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E

Measures in SQL (SIGMOD 2024, Santiago, Chile)

Julian Hyde

SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries. SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL. To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context. A talk at SIGMOD, June 9–15, 2024, Santiago, Chile Authors: Julian Hyde (Google) and John Fremlin (Google) https://doi.org/10.1145/3626246.3653374

DDS-Security 1.2 - What's New? Stronger security for long-running systems

Gerardo Pardo-Castellote

Recently uploaded (20)

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

Revolutionizing Visual Effects Mastering AI Face Swaps.pdf

Energy consumption of Database Management - Florina Jonuzi

KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD

socradar-q1-2024-aviation-industry-report.pdf

E-commerce Application Development Company.pdf

GreenCode-A-VSCode-Plugin--Dario-Jurisic

Using Query Store in Azure PostgreSQL to Understand Query Performance

Using Xen Hypervisor for Functional Safety

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

What is Master Data Management by PiLog Group

Fundamentals of Programming and Language Processors

Oracle Database 19c New Features for DBAs and Developers.pptx

SMS API Integration in Saudi Arabia| Best SMS API Service

A Study of Variable-Role-based Feature Enrichment in Neural Models of Code

SWEBOK and Education at FUSE Okinawa 2024

Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf

OpenMetadata Community Meeting - 5th June 2024

Measures in SQL (SIGMOD 2024, Santiago, Chile)

DDS-Security 1.2 - What's New? Stronger security for long-running systems

BigQuery at AppsFlyer - past, present and future

1. Storytelling: BigQuery - past, present and future Mobile Campaign Analytics | Retargeting | Unbiased Attribution Nir Rubinstein - Chief Architect. nir@appsflyer.com

2. State of the union – 2012 Tech stack: • Small, semi-isolated python services • Communication via Redis' pub/sub • Main DB is CouchDB • Not a whole lot going on...

3. State of the union – 2012, cont'd The problem: • A really big raw(!) report is being served by CouchDB • Data in CouchDB is being generated via a View • The entire DB is on hold while a view is generated • Pissed off client What do we do???

4. The Quick Solution Google Bigquery • Hosted solution – we don't have to manage it • Based on Google's BigTable whitepaper • Columnar storage DB • Really easy to start working with...

5. The Modeling Problem Is there one? • Data in Bigquery is divided into: ● Projects ● Datasets ● Tables How to split the data? Is performance constant? How am I charged?

6. The Modeling Problem The naive approach: One Project One Dataset Multiple Tables Hundred of thousands of rows a month

7. The Modeling Problem

8. The Modeling Problem What are the limitations? • Cannot change the schema of a table once created • Cannot update table data • Cannot delete table data Essentially – forward writing only!

9. The Modeling Problem How we tackled these “issues”? • One global “Project” that drives our Raw Reports • New Datasets are created every 30 days - business and cost limitations • Tables in the datasets are versioned!

10. Current State of the Union

11. The Cost Problem Is there a problem? • Storage is very cheap, querying is expansive • Querying is billed by the amount of data scanned. Bigquery is a columnar DB which means that every column that participates in the query is read from beginning to end • Once tables start storing a lot of data, even simple queries with very few columns will be expensive

12. Cost Optimizations • We're processing 8B daily events. Out of those, a few hundred millions are written into Bigquery – only meaningful data. • We've created a unified schema to prevent table version issues • Tables can be query optimized via Table Decorators to limit the time range of queried data • Tables will be named with dates in them in order to support Table Wildcard queries in order to reduce the cost

13. The “Cost” of Cost Optimizations • Performance (querying multiple tables) • Over engineering (inserting to and maintaining multiple tables) • Storage is cheap, but querying is costly. Since querying does a full column scan, there's a debate whether we should store the entire data or parts of it.

14. The Future What are we waiting for? Custom partitioning functions!!!

15. Thank You! (We're hiring)

Editor's Notes

&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;
&lt;number&gt;

BigQuery at AppsFlyer - past, present and future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BigQuery at AppsFlyer - past, present and future

Similar to BigQuery at AppsFlyer - past, present and future (20)

Recently uploaded

Recently uploaded (20)

BigQuery at AppsFlyer - past, present and future

Editor's Notes