From HDFS to Impala - Turbocharge your big data access

•

2 likes•1,248 views

Speaker: Ophir Cohen, Data Platform Leader, LivePerson (@ophchu) In the presentation I'll discuss the evaluation of technologies to access big data repositories. The main technologies I'm reviewing are Java map-reduce, Hive, Pig and Impala as an examples for the generations in Hadoop access technologies. Video Credits: DevCon: http://devcon-jan14.events.co.il/tracks

Technology

Impala - Turbocharge Your Big Data Access
Ophir Cohen,
Data Platform Group Leader,
ophirc@liveperson.com
Jan 2014

Connection Before Content
--> What is my age?
--> How many children do I have?
--> What is my favorite sport?
Also:
• Past: Co-Founder at Collarity, users to content matching
and relevancy engine
• A Big Data expert
• Technologies enthusiastic with preferences to open sources

LivePerson Is...
8,500customers
Creating Meaningful
Customer Connections
SaaS pioneer since 1998
Mission
Customers
Technology

13 TB
per month 20M
Engagements per month
1.8 B
Visits per month
VOLUME
Volumes

Data challenges @ LP
1. ~ 13TB of data each month
2. > 1PB Hadoop cluster
3. Few clusters across the globe
4. ~ 15,000 MR jobs daily on our main cluster
5. Various heterogeneous users (RND projects, PS, analytics, scientists
and more…)

Data accessing challenges @ LP
1. One month can take few hours (or days!)
2. Complex data model
3. PSs and analytics does not know Java (or scala ;) )

Hadoop Recap
1. Created by Doug Cutting (Yahoo employee back then) at about 2005
2. HDFS - distributed, scalable, and portable file system
3. MapReduce - framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) in a reliable, fault-tolerant manner.
4. The leading big data solution

Java Map/Reduce
Pros
➢ It has been there from the beginning
➢ Reliable
➢ Flexible
➢ Easy for Java developers
Cons
➢ You need to know Java
➢ You need to write in Java ;)
➢ Exhausting development cycle

Hive
Pros
➢ Common SQL-like language (HQL)
➢ Running on the cluster - great for trial-and-error method
Cons
➢ Declarative language (and those limited)
➢ Each query takes time

Impala
1. Cloudera initiative
2. As Cloudera says: “Real-Time Queries in Apache Hadoop,
For Real”
3. scalable parallel database technology
4. Impala is integrated with Hadoop to use the same file and
data formats, metadata, security and resource management
frameworks used by MapReduce, Apache Hive, Apache Pig
and other Hadoop software

Impala
✓ Uses Hive interface (HQL)
➢ No new education needed
➢ No (or small) Hive queries rewrite needed
✓ 4 to 30 X faster than Hive
➢ Trial and error works!
✓ Bypass map/reduce

What next???
➢ RDBMS like on top of Hadoop
■ ACID
➢ Faster and faster access
➢ Security
➢ Security
➢ Security
➢ Data serialization solutions

THANK YOU!
We are Hiring
Ophir Cohen,
ophchu@gmail.com
@ophchu
Extended version of this presentation will be given soon
Look for it on IL-TeckTalks group on Meetup:
http://www.meetup.com/ILTechTalks/

Viewers also liked

Continuous Testing Meets the Classroom at Code.orgSauce Labs

Pivotal Failure - Lessons Learned from Lean Startup Machine DCDave Haeffner

The Testable WebDave Haeffner

Agile testing for mere mortalsDave Haeffner

KISS Automation.pyIakiv Kramarenko

How To Use Selenium SuccessfullyDave Haeffner

Full Stack Testing Done WellDave Haeffner

You do not need automation engineer - Sqa Days - 2015 - ENIakiv Kramarenko

Web ui tests examples with selenide, nselene, selene & capybaraIakiv Kramarenko

Selenium BasicsDave Haeffner

Cross Platform Appium Tests: How ToGlobalLogic Ukraine

Polyglot automation - QA Fest - 2015Iakiv Kramarenko

Getting Started with SeleniumDave Haeffner

Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...Iakiv Kramarenko

Bdd lessons-learnedDave Haeffner

Viewers also liked (15)

Continuous Testing Meets the Classroom at Code.org

Pivotal Failure - Lessons Learned from Lean Startup Machine DC

The Testable Web

Agile testing for mere mortals

KISS Automation.py

How To Use Selenium Successfully

Full Stack Testing Done Well

You do not need automation engineer - Sqa Days - 2015 - EN

Web ui tests examples with selenide, nselene, selene & capybara

Selenium Basics

Cross Platform Appium Tests: How To

Polyglot automation - QA Fest - 2015

Getting Started with Selenium

Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...

Bdd lessons-learned

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

MINDCTI Revenue Release Quarter One 2024MIND CTI

Real Time Object Detection Using Open CVKhem

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Artificial Intelligence: Facts and MythsJoaquim Jorge

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Artificial Intelligence Chap.5 : Uncertainty

Powerful Google developer tools for immediate impact! (2023-24 C)

Exploring the Future Potential of AI-Enabled Smartphone Processors

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Boost PC performance: How more available memory can improve productivity

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

presentation ICT roal in 21st century education

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

MINDCTI Revenue Release Quarter One 2024

Real Time Object Detection Using Open CV

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Artificial Intelligence: Facts and Myths

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

🐬 The future of MySQL is Postgres 🐘

Scaling API-first – The story of a global engineering organization

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

From HDFS to Impala - Turbocharge your big data access

1. Impala - Turbocharge Your Big Data Access Ophir Cohen, Data Platform Group Leader, ophirc@liveperson.com Jan 2014

2. Connection Before Content --> What is my age? --> How many children do I have? --> What is my favorite sport? Also: • Past: Co-Founder at Collarity, users to content matching and relevancy engine • A Big Data expert • Technologies enthusiastic with preferences to open sources

3. LivePerson Is... 8,500customers Creating Meaningful Customer Connections SaaS pioneer since 1998 Mission Customers Technology

4. 13 TB per month 20M Engagements per month 1.8 B Visits per month VOLUME Volumes

5. Data challenges @ LP 1. ~ 13TB of data each month 2. > 1PB Hadoop cluster 3. Few clusters across the globe 4. ~ 15,000 MR jobs daily on our main cluster 5. Various heterogeneous users (RND projects, PS, analytics, scientists and more…)

6. Data accessing challenges @ LP 1. One month can take few hours (or days!) 2. Complex data model 3. PSs and analytics does not know Java (or scala ;) )

7. Hadoop Recap 1. Created by Doug Cutting (Yahoo employee back then) at about 2005 2. HDFS - distributed, scalable, and portable file system 3. MapReduce - framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner. 4. The leading big data solution

8. Accessing Hadoop

9. Stone Age Java Map/Reduce

10. Java Map/Reduce Pros ➢ It has been there from the beginning ➢ Reliable ➢ Flexible ➢ Easy for Java developers Cons ➢ You need to know Java ➢ You need to write in Java ;) ➢ Exhausting development cycle

11. Bronze Age Hive

12. Hive Pros ➢ Common SQL-like language (HQL) ➢ Running on the cluster - great for trial-and-error method Cons ➢ Declarative language (and those limited) ➢ Each query takes time

13. Iron Age Impala

14. Impala 1. Cloudera initiative 2. As Cloudera says: “Real-Time Queries in Apache Hadoop, For Real” 3. scalable parallel database technology 4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software

15. Impala

16. Impala ✓ Uses Hive interface (HQL) ➢ No new education needed ➢ No (or small) Hive queries rewrite needed ✓ 4 to 30 X faster than Hive ➢ Trial and error works! ✓ Bypass map/reduce

17. Modern History What next?

18. What next??? ➢ RDBMS like on top of Hadoop ■ ACID ➢ Faster and faster access ➢ Security ➢ Security ➢ Security ➢ Data serialization solutions

19. THANK YOU! We are Hiring Ophir Cohen, ophchu@gmail.com @ophchu Extended version of this presentation will be given soon Look for it on IL-TeckTalks group on Meetup: http://www.meetup.com/ILTechTalks/

Editor's Notes

In LP we are saying ‘Connection before content’ Enthusiastic of new technologies with preferences to open sources
Couple of facts about LP Been around since 98 Doing SaaS from 98 (before sombody invented Saas…) 8 of the top 10 fortune companies are using LP And a LOT of data
Complex data model Need few trial and error to find what you need Crossing data is hard
Great for data scintist and Pss BUT! Also great for me if I just want to check something - no need to write any code. Two main problems: 1. Declarative language (and those limited) 2. Even the error and trial queries takes ages to be executed
MPP (Massively Parallel Processing)
What do you think are the next steps?

From HDFS to Impala - Turbocharge your big data access

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

More from LivePerson

More from LivePerson (18)

Recently uploaded

Recently uploaded (20)

From HDFS to Impala - Turbocharge your big data access

Editor's Notes