SlideShare a Scribd company logo
1 of 25
Download to read offline
Simple Strategies For
Faster Knowledge
Discovery In Big Data
Dr. Ritesh Agrawal
Lead Data Scientist
Zheng Shao
Staff Software Engineer
* 6 continents, 70+ countries, 450 cities
* 1st billionth trip took 5 years. 2nd billionth trip took 1.5 years
* 5+ million trips every day
Data-Driven Culture
Infrastructure Challenge
*Help discover
*Quickly: having low query latency
*Efficiently: minimal cost
Data

eg: log presto queries
Analytics
eg: 20% compute
resources used by timeout
queries
How to Build An Effective Infrastructure
Optimization

eg: build query gate
Forecasting

eg: predict future 

requirements
PRESTO HIVE
HDFS
KAFKA
*User
*Tables
*Join Clause
*Filter Clause
*Execution Related
Metrics
Data Analytics Optimization Forecasting
*Most queried tables

*Most expensive queries

*Top N users

*…

Overview Analytics
*Filter Clause

*Most Joined tables

*Hot & Cold Partitions…

Detailed Analytics
Data Analytics Optimization Forecasting
Data Analytics Optimization Forecasting
others
with
create
select
insert
PercentQueries
0
20
40
60
80
Number of Joins
0 1 2 3 4 5+
Pct of Total Queries Failure Pct in Bucket
90% of compute resources
is consumed by 10% most
expensive queries
* 90% of queries using [table A] filter on [column x]
* 70% of queries using [table A] join to [table B] on
column X
* 90% of queries using [table A, Table B] end using
Column X, Y from table A and Column Z from table B
Data Analytics Optimization Forecasting
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—file_00000
|—file_00000
|—date=20170102
|—file_00000
|—file_00000
|—date=20170103
|—file_00000
|—file_00000
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’
Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—file_00000
|—file_00001
|—date=20170102
|—file_00000
|—file_00001
|—date=20170103
|—file_00000
|—file_00001
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’
Optimization > Partitioned By
|—user
|—db
|— table
|—date=20170101
|—event_type=‘click’
|—file_00000
|—file_00001
|—date=20170102
|—event_type=‘click’
|—file_00000
|—file_00001
|—event_type=‘map’
|—file_00000
|—file_00001
|—….
SELECT *
FROM [table]
WHERE
date = 20170102
AND event_type = ‘click’
Optimization > Partitioned By
*A query that used to take 10 minutes now completes within 2 minutes. 80% reduction
in elapsed time
*95% reduction in resource consumption.
Partitioned By: Key Considerations
* Usage: 

* Identify fields that are often used for filtering (FILTER CLAUSE)

* Cardinality: 

* Low Cardinality — otherwise explodes meta-store

* Skewness:

* data should evenly distributed if all the keys are popular

* skewness is okie if filter value corresponds to smaller dataset
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
Optimization > Clustered By/Bucketing
Optimization > Clustered By/Bucketing
* Identify tables that are often joined together. 

* Cluster tables on join key.
Clustered By/Bucketing: Key Considerations
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[
CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS
]
[
SKEWED BY (col_name, col_name, ...)
ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
]
[
[ROW FORMAT row_format]
[STORED AS file_format]
]
Data Analytics Optimization Forecasting
StorageSKEWED BY
* Row vs Col.

* Significant improvement
by moving to ORC or
parquet format. 

* splits data so that heavy
values are stored in
separate files

* Helps with query
optimization.
Data Analytics Optimization Forecasting
Other techniques
* execution engine:
MapReduce Vs Tez

* Vectorization

* Pre-joined tables
Key Takeaways
! Fast and efficient infrastructure is key to a business’s success.
! Infrastructure Optimization is a constantly ongoing process.
! Infrastructure Data Science is key to building an efficient infrastructure
! Start simple
First & Last Name
Ritesh Agrawal,
Lead Data Scientist, Uber Inc.
Thank you

More Related Content

What's hot

Html table
Html tableHtml table
Html tableJayjZens
 
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...NoSQLmatters
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorceryelliando dias
 
A Tour to MySQL Commands
A Tour to MySQL CommandsA Tour to MySQL Commands
A Tour to MySQL CommandsHikmat Dhamee
 
Tables and Forms in HTML
Tables and Forms in HTMLTables and Forms in HTML
Tables and Forms in HTMLMarlon Jamera
 
Some Examples in R- [Data Visualization--R graphics]
 Some Examples in R- [Data Visualization--R graphics] Some Examples in R- [Data Visualization--R graphics]
Some Examples in R- [Data Visualization--R graphics]Dr. Volkan OBAN
 
Sql queries - Basics
Sql queries - BasicsSql queries - Basics
Sql queries - BasicsPurvik Rana
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Introduction to SQL (for Chicago Booth MBA technology club)
Introduction to SQL (for Chicago Booth MBA technology club)Introduction to SQL (for Chicago Booth MBA technology club)
Introduction to SQL (for Chicago Booth MBA technology club)Jennifer Berk
 
Web design - Working with tables in HTML
Web design - Working with tables in HTMLWeb design - Working with tables in HTML
Web design - Working with tables in HTMLMustafa Kamel Mohammadi
 
Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014cwensel
 
Internal DSLs Scala
Internal DSLs ScalaInternal DSLs Scala
Internal DSLs Scalazefhemel
 
Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)MongoDB
 

What's hot (18)

Sql
SqlSql
Sql
 
Html table
Html tableHtml table
Html table
 
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
A Tour to MySQL Commands
A Tour to MySQL CommandsA Tour to MySQL Commands
A Tour to MySQL Commands
 
Tables and Forms in HTML
Tables and Forms in HTMLTables and Forms in HTML
Tables and Forms in HTML
 
Some Examples in R- [Data Visualization--R graphics]
 Some Examples in R- [Data Visualization--R graphics] Some Examples in R- [Data Visualization--R graphics]
Some Examples in R- [Data Visualization--R graphics]
 
SQLSlide
SQLSlideSQLSlide
SQLSlide
 
HTML: Tables and Forms
HTML: Tables and FormsHTML: Tables and Forms
HTML: Tables and Forms
 
Sql queries - Basics
Sql queries - BasicsSql queries - Basics
Sql queries - Basics
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Introduction to SQL (for Chicago Booth MBA technology club)
Introduction to SQL (for Chicago Booth MBA technology club)Introduction to SQL (for Chicago Booth MBA technology club)
Introduction to SQL (for Chicago Booth MBA technology club)
 
Web design - Working with tables in HTML
Web design - Working with tables in HTMLWeb design - Working with tables in HTML
Web design - Working with tables in HTML
 
Hadoop Summit EU 2014
Hadoop Summit EU   2014Hadoop Summit EU   2014
Hadoop Summit EU 2014
 
R workshop
R workshopR workshop
R workshop
 
Internal DSLs Scala
Internal DSLs ScalaInternal DSLs Scala
Internal DSLs Scala
 
Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)Indexing and Query Optimizer (Richard Kreuter)
Indexing and Query Optimizer (Richard Kreuter)
 
Html Table
Html TableHtml Table
Html Table
 

Similar to Simple Strategies for faster knowledge discovery in big data

Introducción rápida a SQL
Introducción rápida a SQLIntroducción rápida a SQL
Introducción rápida a SQLCarlos Hernando
 
Unit_III_SQL-MySQL-Commands-Basic.pptx usefull
Unit_III_SQL-MySQL-Commands-Basic.pptx  usefullUnit_III_SQL-MySQL-Commands-Basic.pptx  usefull
Unit_III_SQL-MySQL-Commands-Basic.pptx usefullANTOARA2211003040050
 
Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02Lalit009kumar
 
СУБД осень 2012 Лекция 3
СУБД осень 2012 Лекция 3СУБД осень 2012 Лекция 3
СУБД осень 2012 Лекция 3Technopark
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)Michael Rys
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Database queries
Database queriesDatabase queries
Database querieslaiba29012
 
Unidad 4 actividad 1
Unidad 4 actividad 1Unidad 4 actividad 1
Unidad 4 actividad 1KARY
 
Introducing ms sql_server_updated
Introducing ms sql_server_updatedIntroducing ms sql_server_updated
Introducing ms sql_server_updatedleetinhf
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...DataStax
 
Les09 (using ddl statements to create and manage tables)
Les09 (using ddl statements to create and manage tables)Les09 (using ddl statements to create and manage tables)
Les09 (using ddl statements to create and manage tables)Achmad Solichin
 
Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)Vidyasagar Mundroy
 

Similar to Simple Strategies for faster knowledge discovery in big data (20)

Introducción rápida a SQL
Introducción rápida a SQLIntroducción rápida a SQL
Introducción rápida a SQL
 
Apache TAJO
Apache TAJOApache TAJO
Apache TAJO
 
Unit_III_SQL-MySQL-Commands-Basic.pptx usefull
Unit_III_SQL-MySQL-Commands-Basic.pptx  usefullUnit_III_SQL-MySQL-Commands-Basic.pptx  usefull
Unit_III_SQL-MySQL-Commands-Basic.pptx usefull
 
Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02Lecture05sql 110406195130-phpapp02
Lecture05sql 110406195130-phpapp02
 
SQL-MySQL-Commands-Basic.pptx
SQL-MySQL-Commands-Basic.pptxSQL-MySQL-Commands-Basic.pptx
SQL-MySQL-Commands-Basic.pptx
 
СУБД осень 2012 Лекция 3
СУБД осень 2012 Лекция 3СУБД осень 2012 Лекция 3
СУБД осень 2012 Лекция 3
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)U-SQL Partitioned Data and Tables (SQLBits 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
 
Database
Database Database
Database
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Database queries
Database queriesDatabase queries
Database queries
 
Unidad 4 actividad 1
Unidad 4 actividad 1Unidad 4 actividad 1
Unidad 4 actividad 1
 
Dbms sql-final
Dbms  sql-finalDbms  sql-final
Dbms sql-final
 
Access 04
Access 04Access 04
Access 04
 
Introducing ms sql_server_updated
Introducing ms sql_server_updatedIntroducing ms sql_server_updated
Introducing ms sql_server_updated
 
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...
 
Les09
Les09Les09
Les09
 
Les09 (using ddl statements to create and manage tables)
Les09 (using ddl statements to create and manage tables)Les09 (using ddl statements to create and manage tables)
Les09 (using ddl statements to create and manage tables)
 
Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)Database Systems - SQL - DDL Statements (Chapter 3/2)
Database Systems - SQL - DDL Statements (Chapter 3/2)
 
Html Tags
Html TagsHtml Tags
Html Tags
 

Recently uploaded

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 

Recently uploaded (20)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 

Simple Strategies for faster knowledge discovery in big data

  • 1. Simple Strategies For Faster Knowledge Discovery In Big Data Dr. Ritesh Agrawal Lead Data Scientist Zheng Shao Staff Software Engineer
  • 2. * 6 continents, 70+ countries, 450 cities * 1st billionth trip took 5 years. 2nd billionth trip took 1.5 years * 5+ million trips every day
  • 4. Infrastructure Challenge *Help discover *Quickly: having low query latency *Efficiently: minimal cost
  • 5. Data
 eg: log presto queries Analytics eg: 20% compute resources used by timeout queries How to Build An Effective Infrastructure Optimization
 eg: build query gate Forecasting
 eg: predict future 
 requirements
  • 6. PRESTO HIVE HDFS KAFKA *User *Tables *Join Clause *Filter Clause *Execution Related Metrics Data Analytics Optimization Forecasting
  • 7. *Most queried tables *Most expensive queries *Top N users *… Overview Analytics *Filter Clause *Most Joined tables *Hot & Cold Partitions… Detailed Analytics Data Analytics Optimization Forecasting
  • 8. Data Analytics Optimization Forecasting others with create select insert PercentQueries 0 20 40 60 80 Number of Joins 0 1 2 3 4 5+ Pct of Total Queries Failure Pct in Bucket
  • 9. 90% of compute resources is consumed by 10% most expensive queries
  • 10. * 90% of queries using [table A] filter on [column x] * 70% of queries using [table A] join to [table B] on column X * 90% of queries using [table A, Table B] end using Column X, Y from table A and Column Z from table B Data Analytics Optimization Forecasting
  • 11. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 12. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 13. Optimization > Partitioned By |—user |—db |— table |—date=20170101 |—file_00000 |—file_00000 |—date=20170102 |—file_00000 |—file_00000 |—date=20170103 |—file_00000 |—file_00000 SELECT * FROM [table] WHERE date = 20170102 AND event_type = ‘click’
  • 14. Optimization > Partitioned By |—user |—db |— table |—date=20170101 |—file_00000 |—file_00001 |—date=20170102 |—file_00000 |—file_00001 |—date=20170103 |—file_00000 |—file_00001 SELECT * FROM [table] WHERE date = 20170102 AND event_type = ‘click’
  • 15. Optimization > Partitioned By |—user |—db |— table |—date=20170101 |—event_type=‘click’ |—file_00000 |—file_00001 |—date=20170102 |—event_type=‘click’ |—file_00000 |—file_00001 |—event_type=‘map’ |—file_00000 |—file_00001 |—…. SELECT * FROM [table] WHERE date = 20170102 AND event_type = ‘click’
  • 16. Optimization > Partitioned By *A query that used to take 10 minutes now completes within 2 minutes. 80% reduction in elapsed time *95% reduction in resource consumption.
  • 17. Partitioned By: Key Considerations * Usage: * Identify fields that are often used for filtering (FILTER CLAUSE) * Cardinality: * Low Cardinality — otherwise explodes meta-store * Skewness: * data should evenly distributed if all the keys are popular * skewness is okie if filter value corresponds to smaller dataset
  • 18. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 19. Optimization > Clustered By/Bucketing
  • 20. Optimization > Clustered By/Bucketing
  • 21. * Identify tables that are often joined together. * Cluster tables on join key. Clustered By/Bucketing: Key Considerations
  • 22. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [ PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [ CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS ] [ SKEWED BY (col_name, col_name, ...) ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) ] [ [ROW FORMAT row_format] [STORED AS file_format] ] Data Analytics Optimization Forecasting
  • 23. StorageSKEWED BY * Row vs Col. * Significant improvement by moving to ORC or parquet format. * splits data so that heavy values are stored in separate files * Helps with query optimization. Data Analytics Optimization Forecasting Other techniques * execution engine: MapReduce Vs Tez * Vectorization * Pre-joined tables
  • 24. Key Takeaways ! Fast and efficient infrastructure is key to a business’s success. ! Infrastructure Optimization is a constantly ongoing process. ! Infrastructure Data Science is key to building an efficient infrastructure ! Start simple
  • 25. First & Last Name Ritesh Agrawal, Lead Data Scientist, Uber Inc. Thank you