Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving


Published on

Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, that requires scalable, responsive tools that support exploration and discovery of captured content. Here you'll learn about why Warcbase, an open-source platform for managing web archives built on HBase, is one such tool. It provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge, tightly integrates with Hadoop for analytics and data processing, and relies on HBase for storage infrastructure.

Published in: Software
  • Be the first to comment

  • Be the first to like this

HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving

  1. 1. Warcbase: Scaling ‘Out’, ‘In’, and ‘Down’ HBase for Web Archiving Jimmy Lin University of Maryland @lintool Ian Milligan University of Waterloo @ianmilligan1 HBaseCon2015 Thursday, May 7, 2015
  2. 2. Source: Wikipedia (All Souls College, Oxford) From the Ivory Tower…
  3. 3. Source: Wikipedia (Factory) … to building sh*t that works
  4. 4. Gupta et al.WTF:The Who to Follow Service at Twitter.WWW 2013 Lin and Kolcz. Large-Scale Machine Learning at Twitter. SIGMOD 2012 Busch et al. Earlybird: Real-Time Search at Twitter. ICDE 2012 Mishne et al. Fast Data in the Era of Big Data:Twitter's Real- Time Related Query Suggestion Architecture. SIGMOD 2013. Leibert et al. Automatic Management of Partitioned, Replicated Search Services. SoCC 2011 I worked on… – data products to surface relevant content to users – analytics infrastructure to support data science
  5. 5. Source:
  6. 6. circa ~2010 ~150 people total ~60 Hadoop nodes ~6 people use analytics stack daily circa ~2012 ~1400 people total 10s of Ks of Hadoop nodes, multiple DCs 10s of PBs total Hadoop DW capacity ~100 TB ingest daily dozens of teams use Hadoop daily 10s of Ks of Hadoop jobs daily
  7. 7. Source: Wikipedia (All Souls College, Oxford) And back!
  8. 8. “That sucks.” Big Data for Good Source: “The best minds of my generation are thinking about how to make people click ads.” – Jeff Hammerbacher
  9. 9. Source: Big Data for Good Cultural Heritage Preservation
  10. 10. Source: Wikipedia (Opposition to United States involvement in the Vietnam War) When does an event become “history”?
  11. 11. When does an event become “history”? ~20-30 years later The history of the 1960s were written in the 1980s!
  12. 12. This scandal was brought to you by the digital revolution. That meant we could access all the information we wanted, when we wanted it, anytime, anywhere, and when the story broke in January 1998, it broke online. It was the first time the traditional news was usurped by the Internet for a major news story, a click that reverberated around the world… … it was before social media, but people could still comment online, email stories, and, of course, email cruel jokes. News sources plastered photos of me all over to sell newspapers, banner ads online, and to keep people tuned to the TV. Monica Lewinsky – The Price of Shame TED Talk, March 2015 “ ” So, where are those web pages now?
  13. 13. When does an event become “history”? ~20-30 years later Historians are getting ready to write about the 90s… Right about now!
  14. 14. Can you write a history of the