Your SlideShare is downloading. ×
From HDFS to Impala - Turbocharge your big data access
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

From HDFS to Impala - Turbocharge your big data access


Published on

Speaker: Ophir Cohen, Data Platform Leader, LivePerson (@ophchu) …

Speaker: Ophir Cohen, Data Platform Leader, LivePerson (@ophchu)
In the presentation I'll discuss the evaluation of technologies to access big data repositories.
The main technologies I'm reviewing are Java map-reduce, Hive, Pig and Impala as an examples for the generations in Hadoop access technologies.
Video Credits: DevCon:

Published in: Technology

1 Comment
1 Like
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • In LP we are saying ‘Connection before content’
    Enthusiastic of new technologies with preferences to open sources
  • Couple of facts about LP
    Been around since 98
    Doing SaaS from 98 (before sombody invented Saas…)
    8 of the top 10 fortune companies are using LP
    And a LOT of data
  • Complex data model
    Need few trial and error to find what you need
    Crossing data is hard
  • Great for data scintist and Pss BUT!
    Also great for me if I just want to check something - no need to write any code.
    Two main problems:
    1. Declarative language (and those limited)
    2. Even the error and trial queries takes ages to be executed
  • MPP (Massively Parallel Processing)
  • What do you think are the next steps?
  • Transcript

    • 1. Impala - Turbocharge Your Big Data Access Ophir Cohen, Data Platform Group Leader, Jan 2014
    • 2. Connection Before Content --> What is my age? --> How many children do I have? --> What is my favorite sport? Also: • Past: Co-Founder at Collarity, users to content matching and relevancy engine • A Big Data expert • Technologies enthusiastic with preferences to open sources
    • 3. LivePerson Is... 8,500customers Creating Meaningful Customer Connections SaaS pioneer since 1998 Mission Customers Technology
    • 4. 13 TB per month 20M Engagements per month 1.8 B Visits per month VOLUME Volumes
    • 5. Data challenges @ LP 1. ~ 13TB of data each month 2. > 1PB Hadoop cluster 3. Few clusters across the globe 4. ~ 15,000 MR jobs daily on our main cluster 5. Various heterogeneous users (RND projects, PS, analytics, scientists and more…)
    • 6. Data accessing challenges @ LP 1. One month can take few hours (or days!) 2. Complex data model 3. PSs and analytics does not know Java (or scala ;) )
    • 7. Hadoop Recap 1. Created by Doug Cutting (Yahoo employee back then) at about 2005 2. HDFS - distributed, scalable, and portable file system 3. MapReduce - framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner. 4. The leading big data solution
    • 8. Accessing Hadoop
    • 9. Stone Age Java Map/Reduce
    • 10. Java Map/Reduce Pros ➢ It has been there from the beginning ➢ Reliable ➢ Flexible ➢ Easy for Java developers Cons ➢ You need to know Java ➢ You need to write in Java ;) ➢ Exhausting development cycle
    • 11. Bronze Age Hive
    • 12. Hive Pros ➢ Common SQL-like language (HQL) ➢ Running on the cluster - great for trial-and-error method Cons ➢ Declarative language (and those limited) ➢ Each query takes time
    • 13. Iron Age Impala
    • 14. Impala 1. Cloudera initiative 2. As Cloudera says: “Real-Time Queries in Apache Hadoop, For Real” 3. scalable parallel database technology 4. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software
    • 15. Impala
    • 16. Impala ✓ Uses Hive interface (HQL) ➢ No new education needed ➢ No (or small) Hive queries rewrite needed ✓ 4 to 30 X faster than Hive ➢ Trial and error works! ✓ Bypass map/reduce
    • 17. Modern History What next?
    • 18. What next??? ➢ RDBMS like on top of Hadoop ■ ACID ➢ Faster and faster access ➢ Security ➢ Security ➢ Security ➢ Data serialization solutions
    • 19. THANK YOU! We are Hiring Ophir Cohen, @ophchu Extended version of this presentation will be given soon Look for it on IL-TeckTalks group on Meetup: