Impala - Turbocharge Your Big Data Access
Ophir Cohen,
Data Platform Group Leader,
ophirc@liveperson.com
Jan 2014
Connection Before Content
--> What is my age?
--> How many children do I have?
--> What is my favorite sport?
Also:
• Past...
LivePerson Is...
8,500customers
Creating Meaningful
Customer Connections
SaaS pioneer since 1998
Mission
Customers
Technol...
13 TB
per month 20M
Engagements per month
1.8 B
Visits per month
VOLUME
Volumes
Data challenges @ LP
1. ~ 13TB of data each month
2. > 1PB Hadoop cluster
3. Few clusters across the globe
4. ~ 15,000 MR ...
Data accessing challenges @ LP
1. One month can take few hours (or days!)
2. Complex data model
3. PSs and analytics does ...
Hadoop Recap
1. Created by Doug Cutting (Yahoo employee back then) at about 2005
2. HDFS - distributed, scalable, and port...
Accessing Hadoop
Stone Age
Java Map/Reduce
Java Map/Reduce
Pros
➢ It has been there from the beginning
➢ Reliable
➢ Flexible
➢ Easy for Java developers
Cons
➢ You ne...
Bronze Age
Hive
Hive
Pros
➢ Common SQL-like language (HQL)
➢ Running on the cluster - great for trial-and-error method
Cons
➢ Declarative ...
Iron Age
Impala
Impala
1. Cloudera initiative
2. As Cloudera says: “Real-Time Queries in Apache Hadoop,
For Real”
3. scalable parallel dat...
Impala
Impala
✓ Uses Hive interface (HQL)
➢ No new education needed
➢ No (or small) Hive queries rewrite needed
✓ 4 to 30 X faste...
Modern History
What next?
What next???
➢ RDBMS like on top of Hadoop
■ ACID
➢ Faster and faster access
➢ Security
➢ Security
➢ Security
➢ Data seria...
THANK YOU!
We are Hiring
Ophir Cohen,
ophchu@gmail.com
@ophchu
Extended version of this presentation will be given soon
Lo...
Upcoming SlideShare
Loading in...5
×

From HDFS to Impala - Turbocharge your big data access

399

Published on

Speaker: Ophir Cohen, Data Platform Leader, LivePerson (@ophchu)
In the presentation I'll discuss the evaluation of technologies to access big data repositories.
The main technologies I'm reviewing are Java map-reduce, Hive, Pig and Impala as an examples for the generations in Hadoop access technologies.
Video Credits: DevCon: http://devcon-jan14.events.co.il/tracks

Published in: Technology
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
399
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • In LP we are saying ‘Connection before content’
    Enthusiastic of new technologies with preferences to open sources
  • Couple of facts about LP
    Been around since 98
    Doing SaaS from 98 (before sombody invented Saas…)
    8 of the top 10 fortune companies are using LP
    And a LOT of data
  • Complex data model
    Need few trial and error to find what you need
    Crossing data is hard
  • Great for data scintist and Pss BUT!
    Also great for me if I just want to check something - no need to write any code.
    Two main problems:
    1. Declarative language (and those limited)
    2. Even the error and trial queries takes ages to be executed
  • MPP (Massively Parallel Processing)
  • What do you think are the next steps?
  • Transcript of "From HDFS to Impala - Turbocharge your big data access "

    1. 1. Impala - Turbocharge Your Big Data Access Ophir Cohen, Data Platform Group Leader, ophirc@liveperson.com Jan 2014
    2. 2. Connection Before Content --> What is my age? --> How many children do I have? --> What is my favorite sport? Also: • Past: Co-Founder at Collarity, users to content matching and relevancy engine • A Big Data expert • Technologies enthusiastic with preferences to open sources
    3. 3. LivePerson Is... 8,500customers Creating Meaningful Customer Connections SaaS pioneer since 1998 Mission Customers Technology
    4. 4. 13 TB per month 20M Engagements per month 1.8 B Visits per month VOLUME Volumes
    5. 5. Data challenges @ LP 1. ~ 13TB of data each month 2. > 1PB Hadoop cluster 3. Few clusters across the globe 4. ~ 15,000 MR jobs daily on our main cluster 5. Various heterogeneous users (RND projects, PS, analytics, scientists and more…)
    6. 6. Data accessing challenges @ LP 1. One month can take few hours (or days!) 2. Complex data model 3. PSs and analytics does not know Java (or scala ;) )
    7. 7. Hadoop Recap 1. Created by Doug Cutting (Yahoo employee back then) at about 2005 2. HDFS - distributed, scalable, and portable file system 3. MapReduce - framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner. 4. The leading big data solution
    8. 8. Accessing Hadoop
    9. 9. Stone Age Java Map/Reduce
    10. 10. Java Map/Reduce Pros ➢ It has been there from the beginning ➢ Reliable ➢ Flexible ➢ Easy for Java developers Cons ➢ You need to know Java ➢ You need to write in Java ;) ➢ Exhausting development cycle
    11. 11. Bronze Age Hive
    12. 12. Hive Pros ➢ Common SQL-like language (HQL) ➢ Running on the cluster - great for trial-and-error method Cons ➢ Declarative language (and those limited) ➢ Each query takes time
    13. 13. Iron Age Impala
    14. 14. Impala 1. Cloudera initiative 2. As Cloudera says: “Real-Time Queries in Apache Hadoop, For Real” 3. scalable parallel database technology 4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software
    15. 15. Impala
    16. 16. Impala ✓ Uses Hive interface (HQL) ➢ No new education needed ➢ No (or small) Hive queries rewrite needed ✓ 4 to 30 X faster than Hive ➢ Trial and error works! ✓ Bypass map/reduce
    17. 17. Modern History What next?
    18. 18. What next??? ➢ RDBMS like on top of Hadoop ■ ACID ➢ Faster and faster access ➢ Security ➢ Security ➢ Security ➢ Data serialization solutions
    19. 19. THANK YOU! We are Hiring Ophir Cohen, ophchu@gmail.com @ophchu Extended version of this presentation will be given soon Look for it on IL-TeckTalks group on Meetup: http://www.meetup.com/ILTechTalks/

    ×