Simple Strategies for faster knowledge discovery in big data

•

0 likes•101 views

In today's data-driven world the challenge is to find meaningful nuggets of information quickly and efficiently. This requires an effective infrastructure. This talk presents few simple ideas to make big data infrastructure efficient and effective using simple strategies such as partitioning, bucketing, etc.

Software

* 6 continents, 70+ countries, 450 cities
* 1st billionth trip took 5 years. 2nd billionth trip took 1.5 years
* 5+ million trips every day

Infrastructure Challenge
*Help discover
*Quickly: having low query latency
*Efficiently: minimal cost

Data 
eg: log presto queries
Analytics
eg: 20% compute
resources used by timeout
queries
How to Build An Effective Infrastructure
Optimization 
eg: build query gate
Forecasting 
eg: predict future  
requirements

PRESTO HIVE
HDFS
KAFKA
*User
*Tables
*Join Clause
*Filter Clause
*Execution Related
Metrics
Data Analytics Optimization Forecasting

*Most queried tables

*Most expensive queries

*Top N users

*…

Overview Analytics
*Filter Clause

*Most Joined tables

*Hot & Cold Partitions…

Detailed Analytics
Data Analytics Optimization Forecasting

Data Analytics Optimization Forecasting
others
with
create
select
insert
PercentQueries
0
20
40
60
80
Number of Joins
0 1 2 3 4 5+
Pct of Total Queries Failure Pct in Bucket

90% of compute resources
is consumed by 10% most
expensive queries

* 90% of queries using [table A] filter on [column x]
* 70% of queries using [table A] join to [table B] on
column X
* 90% of queries using [table A, Table B] end using
Column X, Y from table A and Column Z from table B
Data Analytics Optimization Forecasting

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting

Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—file_00000
|—file_00000
|—date=20170102
|—file_00000
|—file_00000
|—date=20170103
|—file_00000
|—file_00000
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’

Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—file_00000
|—file_00001
|—date=20170102
|—file_00000
|—file_00001
|—date=20170103
|—file_00000
|—file_00001
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’

Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—event_type=‘click’
|—file_00000
|—file_00001
|—date=20170102
|—event_type=‘click’
|—file_00000
|—file_00001
|—event_type=‘map’
|—file_00000
|—file_00001
|—….
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’

Optimization > Partitioned By
*A query that used to take 10 minutes now completes within 2 minutes. 80% reduction
in elapsed time
*95% reduction in resource consumption.

Partitioned By: Key Considerations
* Usage:

* Identify fields that are often used for filtering (FILTER CLAUSE)

* Cardinality:

* Low Cardinality — otherwise explodes meta-store

* Skewness:

* data should evenly distributed if all the keys are popular

* skewness is okie if filter value corresponds to smaller dataset

* Identify tables that are often joined together.

* Cluster tables on join key.
Clustered By/Bucketing: Key Considerations

StorageSKEWED BY
* Row vs Col.

* Significant improvement
by moving to ORC or
parquet format.

* splits data so that heavy
values are stored in
separate files

* Helps with query
optimization.
Data Analytics Optimization Forecasting
Other techniques
* execution engine:
MapReduce Vs Tez

* Vectorization

* Pre-joined tables

Key Takeaways
! Fast and efficient infrastructure is key to a business’s success.
! Infrastructure Optimization is a constantly ongoing process.
! Infrastructure Data Science is key to building an efficient infrastructure
! Start simple

First & Last Name
Ritesh Agrawal,
Lead Data Scientist, Uber Inc.
Thank you

What's hot

Sqltech4us

Html tableJayjZens

Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters

Incanter Data Sorceryelliando dias

A Tour to MySQL CommandsHikmat Dhamee

Tables and Forms in HTMLMarlon Jamera

Some Examples in R- [Data Visualization--R graphics]Dr. Volkan OBAN

SQLSlideOrtonRandyandy34534

HTML: Tables and FormsBG Java EE Course

Sql queries - BasicsPurvik Rana

Efficient spatial queries on vanilla databasesJulian Hyde

Introduction to SQL (for Chicago Booth MBA technology club)Jennifer Berk

Web design - Working with tables in HTMLMustafa Kamel Mohammadi

Hadoop Summit EU 2014cwensel

R workshopRevanth19921

Internal DSLs Scalazefhemel

Indexing and Query Optimizer (Richard Kreuter)MongoDB

Html Tablenehashinde41

What's hot (18)

Sql

Html table

Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...

Incanter Data Sorcery

A Tour to MySQL Commands

Tables and Forms in HTML

Some Examples in R- [Data Visualization--R graphics]

SQLSlide

HTML: Tables and Forms

Sql queries - Basics

Efficient spatial queries on vanilla databases

Introduction to SQL (for Chicago Booth MBA technology club)

Web design - Working with tables in HTML

Hadoop Summit EU 2014

R workshop

Internal DSLs Scala

Indexing and Query Optimizer (Richard Kreuter)

Html Table

Similar to Simple Strategies for faster knowledge discovery in big data

Introducción rápida a SQLCarlos Hernando

Apache TAJOAsis Mohanty

Unit_III_SQL-MySQL-Commands-Basic.pptx usefullANTOARA2211003040050

Lecture05sql 110406195130-phpapp02Lalit009kumar

SQL-MySQL-Commands-Basic.pptxKarthikeyan Muthukrishnan

СУБД осень 2012 Лекция 3Technopark

Spark Dataframe - Mr. JyotiskaSigmoid

U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys

Database NoorullahZamindar

Tactical data engineeringJulian Hyde

Database querieslaiba29012

Unidad 4 actividad 1KARY

Dbms sql-finalNV Chandra Sekhar Nittala

Access 04Alexander Babich

Introducing ms sql_server_updatedleetinhf

CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...DataStax

Les09Sudharsan S

Les09 (using ddl statements to create and manage tables)Achmad Solichin

Database Systems - SQL - DDL Statements (Chapter 3/2)Vidyasagar Mundroy

Html TagsRodel Barcenas

Similar to Simple Strategies for faster knowledge discovery in big data (20)

Introducción rápida a SQL

Apache TAJO

Unit_III_SQL-MySQL-Commands-Basic.pptx usefull

Lecture05sql 110406195130-phpapp02

SQL-MySQL-Commands-Basic.pptx

СУБД осень 2012 Лекция 3

Spark Dataframe - Mr. Jyotiska

U-SQL Partitioned Data and Tables (SQLBits 2016)

Database

Tactical data engineering

Database queries

Unidad 4 actividad 1

Dbms sql-final

Access 04

Introducing ms sql_server_updated

CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...

Les09

Les09 (using ddl statements to create and manage tables)

Database Systems - SQL - DDL Statements (Chapter 3/2)

Html Tags

Recently uploaded

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

TECUNIQUE: Success Stories: IT Service providermohitmore19

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Software Quality Assurance Interview QuestionsArshad QA

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Right Money Management App For Your Financial GoalsJhone kinadey

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Recently uploaded (20)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Optimizing AI for immediate response in Smart CCTV

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

A Secure and Reliable Document Management System is Essential.docx

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

TECUNIQUE: Success Stories: IT Service provider

Microsoft AI Transformation Partner Playbook.pdf

Hand gesture recognition PROJECT PPT.pptx

Unlocking the Future of AI Agents with Large Language Models

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

HR Software Buyers Guide in 2024 - HRSoftware.com

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Software Quality Assurance Interview Questions

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Right Money Management App For Your Financial Goals

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Simple Strategies for faster knowledge discovery in big data

1. Simple Strategies For Faster Knowledge Discovery In Big Data Dr. Ritesh Agrawal Lead Data Scientist Zheng Shao Staff Software Engineer

2. * 6 continents, 70+ countries, 450 cities * 1st billionth trip took 5 years. 2nd billionth trip took 1.5 years * 5+ million trips every day

3. Data-Driven Culture

4. Infrastructure Challenge *Help discover *Quickly: having low query latency *Efficiently: minimal cost

5. Data  eg: log presto queries Analytics eg: 20% compute resources used by timeout queries How to Build An Effective Infrastructure Optimization  eg: build query gate Forecasting  eg: predict future   requirements

6. PRESTO HIVE HDFS KAFKA *User *Tables *Join Clause *Filter Clause *Execution Related Metrics Data Analytics Optimization Forecasting

7. *Most queried tables *Most expensive queries *Top N users *… Overview Analytics *Filter Clause *Most Joined tables *Hot & Cold Partitions… Detailed Analytics Data Analytics Optimization Forecasting

8. Data Analytics Optimization Forecasting others with create select insert PercentQueries 0 20 40 60 80 Number of Joins 0 1 2 3 4 5+ Pct of Total Queries Failure Pct in Bucket

9. 90% of compute resources is consumed by 10% most expensive queries

10. * 90% of queries using [table A] filter on [column x] * 70% of queries using [table A] join to [table B] on column X * 90% of queries using [table A, Table B] end using Column X, Y from table A and Column Z from table B Data Analytics Optimization Forecasting

11. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting

12. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting

16. Optimization > Partitioned By *A query that used to take 10 minutes now completes within 2 minutes. 80% reduction in elapsed time *95% reduction in resource consumption.

17. Partitioned By: Key Considerations * Usage: * Identify fields that are often used for filtering (FILTER CLAUSE) * Cardinality: * Low Cardinality — otherwise explodes meta-store * Skewness: * data should evenly distributed if all the keys are popular * skewness is okie if filter value corresponds to smaller dataset

18. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting

19. Optimization > Clustered By/Bucketing

20. Optimization > Clustered By/Bucketing

21. * Identify tables that are often joined together. * Cluster tables on join key. Clustered By/Bucketing: Key Considerations

22. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting

23. StorageSKEWED BY * Row vs Col. * Significant improvement by moving to ORC or parquet format. * splits data so that heavy values are stored in separate files * Helps with query optimization. Data Analytics Optimization Forecasting Other techniques * execution engine: MapReduce Vs Tez * Vectorization * Pre-joined tables

24. Key Takeaways ! Fast and efficient infrastructure is key to a business’s success. ! Infrastructure Optimization is a constantly ongoing process. ! Infrastructure Data Science is key to building an efficient infrastructure ! Start simple

25. First & Last Name Ritesh Agrawal, Lead Data Scientist, Uber Inc. Thank you

Simple Strategies for faster knowledge discovery in big data

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Simple Strategies for faster knowledge discovery in big data

Similar to Simple Strategies for faster knowledge discovery in big data (20)

Recently uploaded

Recently uploaded (20)

Simple Strategies for faster knowledge discovery in big data