From HDFS to Impala - Turbocharge your big data access

•

2 likes•1,248 views

Speaker: Ophir Cohen, Data Platform Leader, LivePerson (@ophchu) In the presentation I'll discuss the evaluation of technologies to access big data repositories. The main technologies I'm reviewing are Java map-reduce, Hive, Pig and Impala as an examples for the generations in Hadoop access technologies. Video Credits: DevCon: http://devcon-jan14.events.co.il/tracks

Technology

Impala - Turbocharge Your Big Data Access
Ophir Cohen,
Data Platform Group Leader,
ophirc@liveperson.com
Jan 2014

Connection Before Content
--> What is my age?
--> How many children do I have?
--> What is my favorite sport?
Also:
• Past: Co-Founder at Collarity, users to content matching
and relevancy engine
• A Big Data expert
• Technologies enthusiastic with preferences to open sources

LivePerson Is...
8,500customers
Creating Meaningful
Customer Connections
SaaS pioneer since 1998
Mission
Customers
Technology

13 TB
per month 20M
Engagements per month
1.8 B
Visits per month
VOLUME
Volumes

Data challenges @ LP
1. ~ 13TB of data each month
2. > 1PB Hadoop cluster
3. Few clusters across the globe
4. ~ 15,000 MR jobs daily on our main cluster
5. Various heterogeneous users (RND projects, PS, analytics, scientists
and more…)

Data accessing challenges @ LP
1. One month can take few hours (or days!)
2. Complex data model
3. PSs and analytics does not know Java (or scala ;) )

Hadoop Recap
1. Created by Doug Cutting (Yahoo employee back then) at about 2005
2. HDFS - distributed, scalable, and portable file system
3. MapReduce - framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) in a reliable, fault-tolerant manner.
4. The leading big data solution

Java Map/Reduce
Pros
➢ It has been there from the beginning
➢ Reliable
➢ Flexible
➢ Easy for Java developers
Cons
➢ You need to know Java
➢ You need to write in Java ;)
➢ Exhausting development cycle

Hive
Pros
➢ Common SQL-like language (HQL)
➢ Running on the cluster - great for trial-and-error method
Cons
➢ Declarative language (and those limited)
➢ Each query takes time

Impala
1. Cloudera initiative
2. As Cloudera says: “Real-Time Queries in Apache Hadoop,
For Real”
3. scalable parallel database technology
4. Impala is integrated with Hadoop to use the same file and
data formats, metadata, security and resource management
frameworks used by MapReduce, Apache Hive, Apache Pig
and other Hadoop software

Impala
✓ Uses Hive interface (HQL)
➢ No new education needed
➢ No (or small) Hive queries rewrite needed
✓ 4 to 30 X faster than Hive
➢ Trial and error works!
✓ Bypass map/reduce

What next???
➢ RDBMS like on top of Hadoop
■ ACID
➢ Faster and faster access
➢ Security
➢ Security
➢ Security
➢ Data serialization solutions

THANK YOU!
We are Hiring
Ophir Cohen,
ophchu@gmail.com
@ophchu
Extended version of this presentation will be given soon
Look for it on IL-TeckTalks group on Meetup:
http://www.meetup.com/ILTechTalks/

Viewers also liked

Continuous Testing Meets the Classroom at Code.orgSauce Labs

Pivotal Failure - Lessons Learned from Lean Startup Machine DCDave Haeffner

The Testable WebDave Haeffner

Agile testing for mere mortalsDave Haeffner

KISS Automation.pyIakiv Kramarenko

How To Use Selenium SuccessfullyDave Haeffner

Full Stack Testing Done WellDave Haeffner

You do not need automation engineer - Sqa Days - 2015 - ENIakiv Kramarenko

Web ui tests examples with selenide, nselene, selene & capybaraIakiv Kramarenko

Selenium BasicsDave Haeffner

Cross Platform Appium Tests: How ToGlobalLogic Ukraine

Polyglot automation - QA Fest - 2015Iakiv Kramarenko

Getting Started with SeleniumDave Haeffner

Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...Iakiv Kramarenko

Bdd lessons-learnedDave Haeffner

Viewers also liked (15)

Continuous Testing Meets the Classroom at Code.org

Pivotal Failure - Lessons Learned from Lean Startup Machine DC

The Testable Web

Agile testing for mere mortals

KISS Automation.py

How To Use Selenium Successfully

Full Stack Testing Done Well

You do not need automation engineer - Sqa Days - 2015 - EN

Web ui tests examples with selenide, nselene, selene & capybara

Selenium Basics

Cross Platform Appium Tests: How To

Polyglot automation - QA Fest - 2015

Getting Started with Selenium

Three Simple Chords of Alternative PageObjects and Hardcore of LoadableCompon...

Bdd lessons-learned

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

How to Remove Document Management Hurdles with X-Docs?XfilesPro

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

CloudStudio User manual (basic edition):comworks

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Install Stable Diffusion in windows machinePadma Pradeep

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Key Features Of Token Development (1).pptxLBM Solutions

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

How to Remove Document Management Hurdles with X-Docs?

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Injustice - Developers Among Us (SciFiDevCon 2024)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

CloudStudio User manual (basic edition):

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Install Stable Diffusion in windows machine

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Key Features Of Token Development (1).pptx

Maximizing Board Effectiveness 2024 Webinar.pptx

08448380779 Call Girls In Civil Lines Women Seeking Men

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Benefits Of Flutter Compared To Other Frameworks

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Azure Monitor & Application Insight to monitor Infrastructure & Application

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

From HDFS to Impala - Turbocharge your big data access

1. Impala - Turbocharge Your Big Data Access Ophir Cohen, Data Platform Group Leader, ophirc@liveperson.com Jan 2014

2. Connection Before Content --> What is my age? --> How many children do I have? --> What is my favorite sport? Also: • Past: Co-Founder at Collarity, users to content matching and relevancy engine • A Big Data expert • Technologies enthusiastic with preferences to open sources

3. LivePerson Is... 8,500customers Creating Meaningful Customer Connections SaaS pioneer since 1998 Mission Customers Technology

4. 13 TB per month 20M Engagements per month 1.8 B Visits per month VOLUME Volumes

5. Data challenges @ LP 1. ~ 13TB of data each month 2. > 1PB Hadoop cluster 3. Few clusters across the globe 4. ~ 15,000 MR jobs daily on our main cluster 5. Various heterogeneous users (RND projects, PS, analytics, scientists and more…)

6. Data accessing challenges @ LP 1. One month can take few hours (or days!) 2. Complex data model 3. PSs and analytics does not know Java (or scala ;) )

7. Hadoop Recap 1. Created by Doug Cutting (Yahoo employee back then) at about 2005 2. HDFS - distributed, scalable, and portable file system 3. MapReduce - framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner. 4. The leading big data solution

8. Accessing Hadoop

9. Stone Age Java Map/Reduce

10. Java Map/Reduce Pros ➢ It has been there from the beginning ➢ Reliable ➢ Flexible ➢ Easy for Java developers Cons ➢ You need to know Java ➢ You need to write in Java ;) ➢ Exhausting development cycle

11. Bronze Age Hive

12. Hive Pros ➢ Common SQL-like language (HQL) ➢ Running on the cluster - great for trial-and-error method Cons ➢ Declarative language (and those limited) ➢ Each query takes time

13. Iron Age Impala

14. Impala 1. Cloudera initiative 2. As Cloudera says: “Real-Time Queries in Apache Hadoop, For Real” 3. scalable parallel database technology 4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software

15. Impala

16. Impala ✓ Uses Hive interface (HQL) ➢ No new education needed ➢ No (or small) Hive queries rewrite needed ✓ 4 to 30 X faster than Hive ➢ Trial and error works! ✓ Bypass map/reduce

17. Modern History What next?

18. What next??? ➢ RDBMS like on top of Hadoop ■ ACID ➢ Faster and faster access ➢ Security ➢ Security ➢ Security ➢ Data serialization solutions

19. THANK YOU! We are Hiring Ophir Cohen, ophchu@gmail.com @ophchu Extended version of this presentation will be given soon Look for it on IL-TeckTalks group on Meetup: http://www.meetup.com/ILTechTalks/

Editor's Notes

In LP we are saying ‘Connection before content’ Enthusiastic of new technologies with preferences to open sources
Couple of facts about LP Been around since 98 Doing SaaS from 98 (before sombody invented Saas…) 8 of the top 10 fortune companies are using LP And a LOT of data
Complex data model Need few trial and error to find what you need Crossing data is hard
Great for data scintist and Pss BUT! Also great for me if I just want to check something - no need to write any code. Two main problems: 1. Declarative language (and those limited) 2. Even the error and trial queries takes ages to be executed
MPP (Massively Parallel Processing)
What do you think are the next steps?

From HDFS to Impala - Turbocharge your big data access

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

More from LivePerson

More from LivePerson (18)

Recently uploaded

Recently uploaded (20)

From HDFS to Impala - Turbocharge your big data access

Editor's Notes