From HDFS to Impala - Turbocharge your big data access


Published on

Speaker: Ophir Cohen, Data Platform Leader, LivePerson (@ophchu)
In the presentation I'll discuss the evaluation of technologies to access big data repositories.
The main technologies I'm reviewing are Java map-reduce, Hive, Pig and Impala as an examples for the generations in Hadoop access technologies.
Video Credits: DevCon:

Published in: Technology
1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • In LP we are saying ‘Connection before content’
    Enthusiastic of new technologies with preferences to open sources
  • Couple of facts about LP
    Been around since 98
    Doing SaaS from 98 (before sombody invented Saas…)
    8 of the top 10 fortune companies are using LP
    And a LOT of data
  • Complex data model
    Need few trial and error to find what you need
    Crossing data is hard
  • Great for data scintist and Pss BUT!
    Also great for me if I just want to check something - no need to write any code.
    Two main problems:
    1. Declarative language (and those limited)
    2. Even the error and trial queries takes ages to be executed
  • MPP (Massively Parallel Processing)
  • What do you think are the next steps?
  • From HDFS to Impala - Turbocharge your big data access

    1. 1. Impala - Turbocharge Your Big Data Access Ophir Cohen, Data Platform Group Leader, Jan 2014
    2. 2. Connection Before Content --> What is my age? --> How many children do I have? --> What is my favorite sport? Also: • Past: Co-Founder at Collarity, users to content matching and relevancy engine • A Big Data expert • Technologies enthusiastic with preferences to open sources
    3. 3. LivePerson Is... 8,500customers Creating Meaningful Customer Connections SaaS pioneer since 1998 Mission Customers Technology
    4. 4. 13 TB per month 20M Engagements per month 1.8 B Visits per month VOLUME Volumes
    5. 5. Data challenges @ LP 1. ~ 13TB of data each month 2. > 1PB Hadoop cluster 3. Few clusters across the globe 4. ~ 15,000 MR jobs daily on our main cluster 5. Various heterogeneous users (RND projects, PS, analytics, scientists and more…)
    6. 6. Data accessing challenges @ LP 1. One month can take few hours (or days!) 2. Complex data model 3. PSs and analytics does not know Java (or scala ;) )
    7. 7. Hadoop Recap 1. Created by Doug Cutting (Yahoo employee back then) at about 2005 2. HDFS - distributed, scalable, and portable file system 3. MapReduce - framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner. 4. The leading big data solution
    8. 8. Accessing Hadoop
    9. 9. Stone Age Java Map/Reduce
    10. 10. Java Map/Reduce Pros ➢ It has been there from the beginning ➢ Reliable ➢ Flexible ➢ Easy for Java developers Cons ➢ You need to know Java ➢ You need to write in Java ;) ➢ Exhausting development cycle
    11. 11. Bronze Age Hive
    12. 12. Hive Pros ➢ Common SQL-like language (HQL) ➢ Running on the cluster - great for trial-and-error method Cons ➢ Declarative language (and those limited) ➢ Each query takes time
    13. 13. Iron Age Impala
    14. 14. Impala 1. Cloudera initiative 2. As Cloudera says: “Real-Time Queries in Apache Hadoop, For Real” 3. scalable parallel database technology 4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software
    15. 15. Impala
    16. 16. Impala ✓ Uses Hive interface (HQL) ➢ No new education needed ➢ No (or small) Hive queries rewrite needed ✓ 4 to 30 X faster than Hive ➢ Trial and error works! ✓ Bypass map/reduce
    17. 17. Modern History What next?
    18. 18. What next??? ➢ RDBMS like on top of Hadoop ■ ACID ➢ Faster and faster access ➢ Security ➢ Security ➢ Security ➢ Data serialization solutions
    19. 19. THANK YOU! We are Hiring Ophir Cohen, @ophchu Extended version of this presentation will be given soon Look for it on IL-TeckTalks group on Meetup: