Impala turbocharge your big data access

•Download as ODP, PDF•

0 likes•594 views

Ophir Cohen

Impala - is that the next analytics tool?

Data & Analytics

Impala - Turbocharge Your Big Data Access
Ophir Cohen,
Data Platform Group Leader,
ophirc@liveperson.com
Jan 2014

LivePerson Is...
Creating Meaningful
Customer Connections
8,500
customers
SaaS pioneer since 1998
Mission
Customers
Technology

Volumes
per month 20ME
13 TB
ngagements per month 1.8 B
Visits per month
VOLUME

Data challenges @ LP
1. ~ 13TB of data each month
2. > 1PB Hadoop cluster
3. Few clusters across the globe
4. ~ 15,000 MR jobs daily on our main cluster
5. Various heterogeneous users (RND projects, PS, analytics, scientists
and more…)

Data accessing challenges @ LP
1. One month can take few hours (or days!)
2. Complex data model
3. PSs and analytics does not know Java (or scala ;) )

Hadoop Recap
1. Created by Doug Cutting (Yahoo employee back then) at about 2005
2. HDFS - distributed, scalable, and portable file system
3. MapReduce - framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) in a reliable, fault-tolerant manner.
4. The leading big data solution

Java Map/Reduce
Pros
➢ It has been there from the beginning
➢ Reliable
➢ Flexible
➢ Easy for Java developers
Cons
➢ You need to know Java
➢ You need to write in Java ;)
➢ Exhausting development cycle

Hive
Pros
➢ Common SQL-like language (HQL)
➢ Running on the cluster - great for trial-and-error method
Cons
➢ Declarative language (and those limited)
➢ Each query takes time

Impala
1. Cloudera initiative
2. As Cloudera says: “Real-Time Queries in Apache Hadoop,
For Real”
3. scalable parallel database technology
4. Impala is integrated with Hadoop to use the same file and
data formats, metadata, security and resource management
frameworks used by MapReduce, Apache Hive, Apache Pig
and other Hadoop software

Impala
✓ Uses Hive interface (HQL)
➢ No new education needed
➢ No (or small) Hive queries rewrite needed
✓ 4 to 30 X faster than Hive
➢ Trial and error works!
✓ Bypass map/reduce

What next???
➢ Security
➢ Security
➢ Security
➢ RDBMS like on top of Hadoop
■ ACID
➢ Faster and faster access
➢ Data serialization solutions

THANK YOU!
We are Hiring
Extended version of this presentation will be given soon
Look for it on IL-TeckTalks group on Meetup:
http://www.meetup.com/ILTechTalks/
Ophir Cohen,
ophchu@gmail.com
@ophchu

What's hot

Apache drillMapR Technologies

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB

Why You Definitely Don’t Want to Build Your Own Time Series DatabaseInfluxData

HBase at MendeleyDan Harvey

Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu

Exponea - Kafka and Hadoop as components of architectureMartinStrycek

Hadoop and Cassandra at RackspaceStu Hood

HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall

Indexing with solr search server and hadoop frameworkkeval dalasaniya

Apache Arrow and Python: The latestWes McKinney

Hadoop at ayasdiMohit Jaggi

Large Scale Graph Analytics with JanusGraphP. Taylor Goetz

Demystifying Data Engineeringnathanmarz

HBaseCon 2013: Near Real Time Indexing for eBay SearchCloudera, Inc.

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit

Introduction to dfMohit Jaggi

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.

Big Data Anti-Patterns: Lessons From the Front LIneDouglas Moore

What's hot (20)

Apache drill

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...

Why You Definitely Don’t Want to Build Your Own Time Series Database

HBase at Mendeley

Developing high frequency indicators using real time tick data on apache supe...

Exponea - Kafka and Hadoop as components of architecture

Hadoop and Cassandra at Rackspace

HUG_Ireland_Apache_Arrow_Tomer_Shiran

Indexing with solr search server and hadoop framework

Apache Arrow and Python: The latest

Hadoop at ayasdi

Large Scale Graph Analytics with JanusGraph

Demystifying Data Engineering

HBaseCon 2013: Near Real Time Indexing for eBay Search

An Incomplete Data Tools Landscape for Hackers in 2015

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

Introduction to df

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...

Big Data Anti-Patterns: Lessons From the Front LIne

Viewers also liked

Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcaderoCodecamp Romania

Hadoop testing workshop - july 2013Ophir Cohen

Big data testing (1)vodqancr

Big Data - Hadoop and MapReduce for QA and testing by Aditya GargQA or the Highway

Big Data TestingQA InfoTech

Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS

Hadoop project design and a usecasesudhakara st

Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS

Viewers also liked (8)

Iasi code camp 20 april 2013 testing big data-anca sfecla - embarcadero

Hadoop testing workshop - july 2013

Big data testing (1)

Big Data - Hadoop and MapReduce for QA and testing by Aditya Garg

Big Data Testing

Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...

Hadoop project design and a usecase

Testing Big Data: Automated Testing of Hadoop with QuerySurge

Similar to Impala turbocharge your big data access

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz

Big Data ConceptsAhmed Salman

Interactive SQL-on-Hadoop and JethroDataOfir Manor

9/2017 STL HUG - Back to SchoolAdam Doyle

The other Apache Technologies your Big Data solution needsgagravarr

963Annu Ahmed

HadoopZubair Arshad

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Hadoop and Big Data: RevealedSachin Holla

Getting started big dataKibrom Gebrehiwot

Bigdata and Hadoop BootcampSpotle.ai

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1

Seminar pptRajatTripathi34

Hadoop ppt1chariorienit

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin

hadoop-ecosystem-ppt.pptxraghavanand36

Need for Time series DatabasePramit Choudhary

Introduction to Hadoop and Big DataJoe Alex

Hadoop trainingTIB Academy

Hadoop PrimerSteve Staso

Similar to Impala turbocharge your big data access (20)

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...

Big Data Concepts

Interactive SQL-on-Hadoop and JethroData

9/2017 STL HUG - Back to School

The other Apache Technologies your Big Data solution needs

963

Hadoop

Big Data Hoopla Simplified - TDWI Memphis 2014

Hadoop and Big Data: Revealed

Getting started big data

Bigdata and Hadoop Bootcamp

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY

Seminar ppt

Hadoop ppt1

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos

hadoop-ecosystem-ppt.pptx

Need for Time series Database

Introduction to Hadoop and Big Data

Hadoop training

Hadoop Primer

Recently uploaded

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

RadioAdProWritingCinderellabyButleri.pdfgstagge

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

How we prevented account sharing with MFAAndrei Kaleshka

Recently uploaded (20)

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Call Girls In Dwarka 9654467111 Escorts Service

Call Girls in Saket 99530🔝 56974 Escort Service

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Customer Service Analytics - Make Sense of All Your Data.pptx

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

RadioAdProWritingCinderellabyButleri.pdf

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

How we prevented account sharing with MFA

Impala turbocharge your big data access

1. Impala - Turbocharge Your Big Data Access Ophir Cohen, Data Platform Group Leader, ophirc@liveperson.com Jan 2014

2. LivePerson Is... Creating Meaningful Customer Connections 8,500 customers SaaS pioneer since 1998 Mission Customers Technology

3. Volumes per month 20ME 13 TB ngagements per month 1.8 B Visits per month VOLUME

4. Data challenges @ LP 1. ~ 13TB of data each month 2. > 1PB Hadoop cluster 3. Few clusters across the globe 4. ~ 15,000 MR jobs daily on our main cluster 5. Various heterogeneous users (RND projects, PS, analytics, scientists and more…)

5. Data accessing challenges @ LP 1. One month can take few hours (or days!) 2. Complex data model 3. PSs and analytics does not know Java (or scala ;) )

6. Hadoop Recap 1. Created by Doug Cutting (Yahoo employee back then) at about 2005 2. HDFS - distributed, scalable, and portable file system 3. MapReduce - framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner. 4. The leading big data solution

7. Accessing Hadoop

8. Stone Age Java Map/Reduce

9. Java Map/Reduce Pros ➢ It has been there from the beginning ➢ Reliable ➢ Flexible ➢ Easy for Java developers Cons ➢ You need to know Java ➢ You need to write in Java ;) ➢ Exhausting development cycle

10. Bronze Age Hive

11. Hive Pros ➢ Common SQL-like language (HQL) ➢ Running on the cluster - great for trial-and-error method Cons ➢ Declarative language (and those limited) ➢ Each query takes time

12. Iron Age Impala

13. Impala 1. Cloudera initiative 2. As Cloudera says: “Real-Time Queries in Apache Hadoop, For Real” 3. scalable parallel database technology 4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software

14. Impala

15. Impala ✓ Uses Hive interface (HQL) ➢ No new education needed ➢ No (or small) Hive queries rewrite needed ✓ 4 to 30 X faster than Hive ➢ Trial and error works! ✓ Bypass map/reduce

16. Modern History What next?

17. What next??? ➢ Security ➢ Security ➢ Security ➢ RDBMS like on top of Hadoop ■ ACID ➢ Faster and faster access ➢ Data serialization solutions

18. THANK YOU! We are Hiring Extended version of this presentation will be given soon Look for it on IL-TeckTalks group on Meetup: http://www.meetup.com/ILTechTalks/ Ophir Cohen, ophchu@gmail.com @ophchu

Editor's Notes

Couple of facts about LP Been around since 98 Doing SaaS from 98 (before sombody invented Saas…) 8 of the top 10 fortune companies are using LP And a LOT of data
Complex data model Need few trial and error to find what you need Crossing data is hard
Great for data scintist and Pss BUT! Also great for me if I just want to check something - no need to write any code. Two main problems: 1. Declarative language (and those limited) 2. Even the error and trial queries takes ages to be executed
MPP (Massively Parallel Processing)
What do you think are the next steps?

Impala turbocharge your big data access

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Impala turbocharge your big data access

Similar to Impala turbocharge your big data access (20)

Recently uploaded

Recently uploaded (20)

Impala turbocharge your big data access

Editor's Notes