Speaker: Ophir Cohen, Data Platform Leader, LivePerson (@ophchu)
In the presentation I'll discuss the evaluation of technologies to access big data repositories.
The main technologies I'm reviewing are Java map-reduce, Hive, Pig and Impala as an examples for the generations in Hadoop access technologies.
Video Credits: DevCon: http://devcon-jan14.events.co.il/tracks
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
From HDFS to Impala - Turbocharge your big data access
1. Impala - Turbocharge Your Big Data Access
Ophir Cohen,
Data Platform Group Leader,
ophirc@liveperson.com
Jan 2014
2. Connection Before Content
--> What is my age?
--> How many children do I have?
--> What is my favorite sport?
Also:
• Past: Co-Founder at Collarity, users to content matching
and relevancy engine
• A Big Data expert
• Technologies enthusiastic with preferences to open sources
4. 13 TB
per month 20M
Engagements per month
1.8 B
Visits per month
VOLUME
Volumes
5. Data challenges @ LP
1. ~ 13TB of data each month
2. > 1PB Hadoop cluster
3. Few clusters across the globe
4. ~ 15,000 MR jobs daily on our main cluster
5. Various heterogeneous users (RND projects, PS, analytics, scientists
and more…)
6. Data accessing challenges @ LP
1. One month can take few hours (or days!)
2. Complex data model
3. PSs and analytics does not know Java (or scala ;) )
7. Hadoop Recap
1. Created by Doug Cutting (Yahoo employee back then) at about 2005
2. HDFS - distributed, scalable, and portable file system
3. MapReduce - framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) in a reliable, fault-tolerant manner.
4. The leading big data solution
10. Java Map/Reduce
Pros
➢ It has been there from the beginning
➢ Reliable
➢ Flexible
➢ Easy for Java developers
Cons
➢ You need to know Java
➢ You need to write in Java ;)
➢ Exhausting development cycle
12. Hive
Pros
➢ Common SQL-like language (HQL)
➢ Running on the cluster - great for trial-and-error method
Cons
➢ Declarative language (and those limited)
➢ Each query takes time
14. Impala
1. Cloudera initiative
2. As Cloudera says: “Real-Time Queries in Apache Hadoop,
For Real”
3. scalable parallel database technology
4. Impala is integrated with Hadoop to use the same file and
data formats, metadata, security and resource management
frameworks used by MapReduce, Apache Hive, Apache Pig
and other Hadoop software
18. What next???
➢ RDBMS like on top of Hadoop
■ ACID
➢ Faster and faster access
➢ Security
➢ Security
➢ Security
➢ Data serialization solutions
19. THANK YOU!
We are Hiring
Ophir Cohen,
ophchu@gmail.com
@ophchu
Extended version of this presentation will be given soon
Look for it on IL-TeckTalks group on Meetup:
http://www.meetup.com/ILTechTalks/
Editor's Notes
In LP we are saying ‘Connection before content’
Enthusiastic of new technologies with preferences to open sources
Couple of facts about LP
Been around since 98
Doing SaaS from 98 (before sombody invented Saas…)
8 of the top 10 fortune companies are using LP
And a LOT of data
Complex data model
Need few trial and error to find what you need
Crossing data is hard
Great for data scintist and Pss BUT!
Also great for me if I just want to check something - no need to write any code.
Two main problems:
1. Declarative language (and those limited)
2. Even the error and trial queries takes ages to be executed