Your SlideShare is downloading. ×
The Meta of Hadoop - COMAD 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

The Meta of Hadoop - COMAD 2012

414
views

Published on

What do you talk about to a hall full of database gurus? Instead of science - my talk focused on the art. What made Hadoop successful? What can we learn from it? What principles work well in building …

What do you talk about to a hall full of database gurus? Instead of science - my talk focused on the art. What made Hadoop successful? What can we learn from it? What principles work well in building software for large scale services? What are some interesting unsolved problems in a world overrun by open-source (and VC investments :-))


5 Comments
2 Likes
Statistics
Notes
  • @vshreepadma - did your customers run-time environment benefit (from efficiency point of view) from collecting stats? If so - I am glad it worked out and clearly I need to come up to speed. I tend to pick the easiest, biggest-bang projects to go after first. There are many runtime improvements to Hadoop and Hive that I think are still not done that can make dramatic improvement in speed and usability. (Many of these need to be in Hadoop land - which is complicated because of hadoop-1/2) As an example - one common stats related problem we see is the case of zero-cookies (null userid). This is a common cause of skew. A pointed effort to identify this pattern would help a lot of internet companies. And this hardly requires much difficult stats gathering. One of the things I tried to convey in the talk is to take a service oriented mindset. For a company like ours providing service - we also want to collect stats (not for efficiency so much right now - but just for helping users understand datasets) - and we would schedule this as low priority stuff in the background (potentially using cheaper compute resources). So I would encourage taking a holistic approach. Can the next generation Hive Server be used to control such background activities by default - and with close coupling with queuing systems in Hadoop (Fair/Capacity Scheduler) to have minimum impact on regular runtime? But still do it automatically. Constraints bring out the best in us. Could capturing stats on samples in an automatic and transparent manner be a good middle-ground? (if the end-goal is to optimize the run-time).
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks for the clarification. Recently, we contributed column level statistics gathering to Hive. Here at Cloudera, we experimented on data sets that closely resemble that of some of our customers and noticed that some of the stats that are traditionally gathered on columns such as ndvs, equi-height histograms are computationally expensive. Computing these stats while loading/scanning the data would significantly impact the performance of the load/scan task. While I agree that it's good to build systems that require little tuning and support, there are cases such as the one outline above where automating the task is not desirable for the sake of performance. Wondering what your take on it is.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I also replied to another Hive contributor about the stats comment:

    My perspective is limited to a period of about a year or so when people tried to implement stats in Hive and deploy it in FB. It made no difference. Stats collection was not automated. Neither was stats collected all the time. Within that timeframe - in spite of significant engineering effort - no impact was made in the runtime efficiency of queries. For me - this was failure.

    I am not sure where things stand today - and while we run Hive as part of our service - I don't think we are collecting stats. A meta-point i was trying to make was that in a world where a lot of data is scanned only once - stats is an optimization. adaptive strategies that discover near-optimal plans in the first pass are easier to manage. similarly - query plans that cause support issues because the stats are out of date (for example) - are also very bad. we need software that's really easy to support. hence the title of the slide (adaptive lights-out software).
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi Carl - apologies - there was a lot of verbal context to this talk. The point I tried to make to the audience was that the collection of statistics needed to be automatic and transparent. In a world where there was little explicit data management -if things were not automated - then they failed. The initial revisions of statistics gathering required (AFAIK) explicit commands to gather stats. It also didn't gather stats as data was scanned (all the time). The project was completely useless.

    The perspective is not that of whether software was written successfully or not - but whether it achieved the end-goal.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You list statistics as failed project but give no reason for this verdict. I'm really curious to know why you think this project didn't work out.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
414
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
5
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • They are…123Now lets look at the details of each step, starting with step #1.
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • Intro- self, Qubole. In this video, we will see how users setup a Qubole Cluster in 3 simple steps..Those 3 steps are…
  • Transcript

    • 1. The Meta of Hadoop Joydeep Sen SarmaEx-Facebook DI Lead, Founder Qubole
    • 2. Intro• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)• @Facebook: – SysAdmin: operated massive Hadoop/Hive installs – Architect: conceived/wrote Apache Hive. made Hbase@FB happen – Herded cats: first manager of Data Infra team – IT engineer/DBA: built ETL tools, warehouse/reporting for FB Virtual Currency• Founder Qubole Inc. (2011-)
    • 3. Why Hadoop Succeeded• Complete Solution and Extensible – useful to Engineers, Data Scientists, Analysts – performance isn’t everything. – Agile – Businesses much faster than before• Market Dynamics – Captive Super-Reference Customer – Yahoo – Had early market to itself for Long-Time• Separation of Compute and Storage – Parallel Computing != Database
    • 4. Why Hadoop Succeeded • Data Consolidation! – Just store everything in HDFS – MR/Hive/Pig can chew anything • Lights Out ArchitectureDATA – Low System Operational Cost – Low Data Management Cost • Don’t need Data PriestsDATA
    • 5. Meta Takeaways
    • 6. Adaptive Lights-Out Software• Successful efforts: – Automatic map-join/skew join implementations – Automatic local mode, resource cache• Failed: – Statistics: alter table analyze table – Pre-Bucketing tables Learning Frameworks for Systems Software
    • 7. Adaptive Lights-Out Software• Caching + Prefetching is Adaptive – Replication is not – Can bridge gap between Compute and Storage• Page Cache over Disk >> In-memory – Degrades gracefully• Provide APIs – not packages
    • 8. Murphy’s Law• No Trusted Components• Defend everything – Rate-Limit access to every resource – Log and Monitor everything• Clear and Overwhelming Force – Oversize it!• Think QOS from Day-1
    • 9. Open Source• Small is Beautiful – Build small easy to use/understand components – Redis!• Iterative Small Changes – Operators HATE large releases – Hive (2 weeks) vs. Hadoop (2 years?)
    • 10. Opportunities
    • 11. Interesting Problems - I• Collaborative Analysis – Most analysis is Repeat – Tracking and Searching historical analysis• Consistency Aware Querying – OLAP: Snapshots instead of live tables – OLTP: Lookup stale caches instead of master
    • 12. Interesting Problems - II• SQL is Rope – Better than procedural – but still Rope – Higher Level templates: moving averages• Data = Mutating + Immutable – Immutable data is easy to manage – Cheap: One copy per data center (Facebook Haystack)
    • 13. Think Services, not Software• Software is getting less interesting – Even Distributed Systems Software• Run/Operate long-running, hot services – Innovate inside this boundary
    • 14. Q&A