3. Volumes
per month 20ME
13 TB
ngagements per month 1.8 B
Visits per month
VOLUME
4. Data challenges @ LP
1. ~ 13TB of data each month
2. > 1PB Hadoop cluster
3. Few clusters across the globe
4. ~ 15,000 MR jobs daily on our main cluster
5. Various heterogeneous users (RND projects, PS, analytics, scientists
and more…)
5. Data accessing challenges @ LP
1. One month can take few hours (or days!)
2. Complex data model
3. PSs and analytics does not know Java (or scala ;) )
6. Hadoop Recap
1. Created by Doug Cutting (Yahoo employee back then) at about 2005
2. HDFS - distributed, scalable, and portable file system
3. MapReduce - framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) in a reliable, fault-tolerant manner.
4. The leading big data solution
9. Java Map/Reduce
Pros
➢ It has been there from the beginning
➢ Reliable
➢ Flexible
➢ Easy for Java developers
Cons
➢ You need to know Java
➢ You need to write in Java ;)
➢ Exhausting development cycle
11. Hive
Pros
➢ Common SQL-like language (HQL)
➢ Running on the cluster - great for trial-and-error method
Cons
➢ Declarative language (and those limited)
➢ Each query takes time
13. Impala
1. Cloudera initiative
2. As Cloudera says: “Real-Time Queries in Apache Hadoop,
For Real”
3. scalable parallel database technology
4. Impala is integrated with Hadoop to use the same file and
data formats, metadata, security and resource management
frameworks used by MapReduce, Apache Hive, Apache Pig
and other Hadoop software
17. What next???
➢ Security
➢ Security
➢ Security
➢ RDBMS like on top of Hadoop
■ ACID
➢ Faster and faster access
➢ Data serialization solutions
18. THANK YOU!
We are Hiring
Extended version of this presentation will be given soon
Look for it on IL-TeckTalks group on Meetup:
http://www.meetup.com/ILTechTalks/
Ophir Cohen,
ophchu@gmail.com
@ophchu
Editor's Notes
Couple of facts about LP
Been around since 98
Doing SaaS from 98 (before sombody invented Saas…)
8 of the top 10 fortune companies are using LP
And a LOT of data
Complex data model
Need few trial and error to find what you need
Crossing data is hard
Great for data scintist and Pss BUT!
Also great for me if I just want to check something - no need to write any code.
Two main problems:
1. Declarative language (and those limited)
2. Even the error and trial queries takes ages to be executed