Kevin Weil presented on Hadoop at Twitter. He discussed Twitter's data lifecycle including data input via Scribe and Crane, storage in HDFS and HBase, analysis using Pig and Oink, and data products like Birdbrain. He described how tools like Scribe, Crane, Elephant Bird, Pig, and HBase were developed and used at Twitter to handle large volumes of log and tabular data at petabyte scale.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service, which show text mining on the "Enron Email Dataset" from Infochimps.com plus data visualization using R and Gephi
Source at: http://github.com/ceteri/ceteri-mapred
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service, which show text mining on the "Enron Email Dataset" from Infochimps.com plus data visualization using R and Gephi
Source at: http://github.com/ceteri/ceteri-mapred
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
This presentation covers our use of Storm and the connectors we've built. It also proposes a design for integrating Storm with real-time web services by embedding parts of topologies directly into the web services layer.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
As the operator of the dominant messenger application in South Korea, KakaoTalk has more than 170 million users, and our ever-growing graph has more than 10B edges and 200M vertices. This scale presents several technical challenges for storing and querying the graph data, but we have resolved them by creating a new distributed graph database with HBase. Here you'll learn the methodology and architecture we used to solve the problems, compare it another famous graph database, Titan, and explore the HBase issues we encountered.
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
Here's the talk that we presented at the Hadoop Summit 2015, in San Jose. This was an inside look at how we at Yahoo scaled Hive to work at Yahoo's data/metadata scale.
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
This presentation covers our use of Storm and the connectors we've built. It also proposes a design for integrating Storm with real-time web services by embedding parts of topologies directly into the web services layer.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
As the operator of the dominant messenger application in South Korea, KakaoTalk has more than 170 million users, and our ever-growing graph has more than 10B edges and 200M vertices. This scale presents several technical challenges for storing and querying the graph data, but we have resolved them by creating a new distributed graph database with HBase. Here you'll learn the methodology and architecture we used to solve the problems, compare it another famous graph database, Titan, and explore the HBase issues we encountered.
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
Here's the talk that we presented at the Hadoop Summit 2015, in San Jose. This was an inside look at how we at Yahoo scaled Hive to work at Yahoo's data/metadata scale.
The Elephant in the Cloud: A Quest for the Next Generation
In this talk, I will go through the evolution of Hadoop and its ecosystem projects and will try to peer into the crystal ball to predict what may be coming down the pike. I will discuss various way of crunching the data on Hadoop (MapReduce, OpenMPI, Spark and various SQL engines) and how these tools compliment each other.
Apache Hadoop is no longer just a faithful, open source, scalable implementation of two seminal papers that came out of Google 10 years ago. It has evolved into a project that provides the enterprises with a reliable layer for storing massive amounts of unstructured data (HDFS) while allowing different computational frameworks to leverage those datasets.
The original computational framework (MapReduce) has evolved into a much more scalable set of general purpose cluster management APIs collectively known as YARN. With YARN underneath, MapReduce is still there to support batch-oriented computations, but it is no longer the only game in town. With OpenMPI, Spark, and Tez rapidly becoming available now is truly the most exciting time to be a developer in a Hadoop ecosystem. It is also the time when you don't have to be employed by Yahoo!, Facebook or EBay to have access to mind-blowing compute power. That power is a credit card and a pivotal.io account away from anybody on the planet.
I will conclude by outlining some of the ongoing work that makes Hadoop and its ecosystem projects first class citizens in cloud environments based on the work that Pivotal engineers have done with integrating Hadoop into PivotalONE PaaS.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.
Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Measuring CDN performance and why you're doing it wrongFastly
Integrating content delivery networks into your application infrastructure can offer many benefits, including major performance improvements for your applications. So understanding how CDNs perform — especially for your specific use cases — is vital. However, testing for measurement is complicated and nuanced, and results in metric overload and confusion. It's becoming increasingly important to understand measurement techniques, what they're telling you, and how to apply them to your actual content.
In this session, we'll examine the challenges around measuring CDN performance and focus on the different methods for measurement. We'll discuss what to measure, important metrics to focus on, and different ways that numbers may mislead you.
More specifically, we'll cover:
Different techniques for measuring CDN performance
Differentiating between network footprint and object delivery performance
Choosing the right content to test
Core metrics to focus on and how each impacts real traffic
Understanding cache hit ratio, why it can be misleading, and how to measure for it
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
An overview of traditional spatial analysis tools, an intro to hadoop and other tools for analyzing terabytes or more of data, and then a primer with examples on combining the two with data pulled from the Twitter streaming API. Given at the O'Reilly Where 2.0 conference in March 2010.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1FQYcP0.
Gian Merlino presents the advantages, challenges, and best practices to deploying and maintaining lambda architectures in the real world, using the infrastructure at Metamarkets as a case study. Filmed at qconsf.com.
Gian Merlino is a senior software engineer at Metamarkets, responsible for the infrastructure behind its data ingestion pipelines and is a committer on the Druid project.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L.
Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com.
Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.
Big Data at Riot Games – Using Hadoop to Understand Player Experience - Stamp...StampedeCon
At the StampedeCon 2013 Big Data conference in St. Louis, Riot Games discussed Using Hadoop to Understand and Improve Player Experience. Riot Games aims to be the most player-focused game company in the world. To fulfill that mission, it’s vital we develop a deep, detailed understanding of players’ experiences. This is particularly challenging since our debut title, League of Legends, is one of the most played video games in the world, with more than 32 million active monthly players across the globe. In this presentation, we’ll discuss several use cases where we sought to understand and improve the player experience, the challenges we faced to solve those use cases, and the big data infrastructure that supports our capability to provide continued insight.
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
Yahoo! has been using HBase for a long time in isolated instances, most notably for the personalization platform powering its homepage experiences. The introduction of multi-tenancy has lowered the barriers for all Hadoop users to use HBase. We will cover traditional use cases for HBase at Yahoo!, and new use cases as a result in content management, advertising, log processing, analytics and reporting, recommendation graphs, and dimension data stores.
We will then talk about the deployment strategy and enhancements made that facilitate multi-tenancy. Region Server groups provide a coarse level of isolation among tenants by designating a subset of region servers to serve designated tables, and Namespaces for logical grouping of resources (region servers, tables) and privileges (quota, ACLs).
We'll also share our experiences in operating HBase with security enabled and contributions made in this area, and results from performance runs conducted to validate customer expectations in a multi-tenant environment.
URL: http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--multi-tenant-apache-hbase-at-yahoo-video.html
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
In the early days of web applications, sites were designed to serve users and gather information along the way. With the proliferation of data sources and growing user bases, the amount of data generated required new ways for storage and processing. Hadoop's HDFS and its batch oriented MapReduce opened new possibilities, yet it falls short of instant delivery of aggregate data to end users. Adding HBase and other layers, such as stream processing using Twitter's Storm, can overcome this delay and bridge the gap to realtime aggregation and reporting. This presentation takes the audience from the beginning of web application design to the current architecture, which combines multiple technologies to be able to process vast amounts of data, while still being able to react timely and report near realtime statistics.
http://berlinbuzzwords.de/sessions/batch-realtime-hadoop
From Eric Baldeschwieler's presentation "Hadoop @ Yahoo! - Internet Scale Data Processing" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/509
Philly Code Camp 2013 Mark Kromer Big Data with SQL ServerMark Kromer
These are my slides from May 2013 Philly Code Camp at Penn State Abington. I will post the samples, code and scripts on my blog here following the event this Saturday: http://www.kromerbigdata.com
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
2. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage
‣ Data Analysis
‣ Data Products
3. The Twitter Data Lifecycle
‣ Data Input: Scribe, Crane
‣ Data Storage: Elephant Bird, HBase
‣ Data Analysis: Pig, Oink
‣ Data Products: Birdbrain
1 Community Open Source
2 Twitter Open Source (or soon)
4. My Background
‣ Studied Mathematics and Physics at Harvard, Physics at
Stanford
‣ Tropos Networks (city-wide wireless): mesh routing algorithms,
GBs of data
‣ Cooliris (web media): Hadoop and Pig for analytics, TBs of data
‣ Twitter: Hadoop, Pig, machine learning, visualization, social
graph analysis, (soon) PBs of data
5. The Twitter Data Lifecycle
‣ Data Input: Scribe, Crane
‣ Data Storage
‣ Data Analysis
‣ Data Products
1 Community Open Source
2 Twitter Open Source
6. What Data?
‣ Two main kinds of raw data
‣ Logs
‣ Tabular data
7. Logs
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale
8. Logs
‣ Started with syslog-ng
‣ As our volume grew, it didn’t scale
‣ Resources overwhelmed
‣ Lost data
9. Scribe
‣ Scribe daemon runs locally; reliable in network outage
‣ Nodes only know downstream
FE FE FE
writer; hierarchical, scalable
‣ Pluggable outputs, per category
Agg Agg
File HDFS
10. Scribe at Twitter
‣ Solved our problem, opened new vistas
‣ Currently 57 different categories logged from multiple sources
‣ FE: Javascript, Ruby on Rails
‣ Middle tier: Ruby on Rails, Scala
‣ Backend: Scala, Java, C++
‣ 7 TB/day into HDFS
‣ Log first, ask questions later.
11. Scribe at Twitter
‣ We’ve contributed to it as we’ve used it1
‣ Improved logging, monitoring, writing to HDFS, compression
‣ Added ZooKeeper-based config
‣ Continuing to work with FB on patches
‣ Also: working with Cloudera to evaluate Flume
1 http://github.com/traviscrawford/scribe
12. Tabular Data
‣ Most site data is in MySQL
‣ Tweets, users, devices, client applications, etc
‣ Need to move it between MySQL and HDFS
‣ Also between MySQL and HBase, or MySQL and MySQL
‣ Crane: configuration driven ETL tool
17. Crane
‣ Extract
‣ MySQL, HDFS, HBase, Flock, GA, Facebook Insights
‣ Transform
‣ IP/Phone -> Geo, canonicalize dates, cleaning, arbitrary logic
‣ Load
‣ MySQL, Local file, Stdout, HDFS, HBase
‣ ZooKeeper coordination, intelligent date management
‣ Run all the time from multiple servers, self healing
18. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage: Elephant Bird, HBase
‣ Data Analysis
‣ Data Products
1 Community Open Source
2 Twitter Open Source
19. Storage Basics
‣ Incoming data: 7 TB/day
‣ LZO encode everything
‣ Save 3-4x on storage, pay little CPU
‣ Splittable!1
‣ IO-bound jobs ==> 3-4x perf increase
1 http://www.github.com/kevinweil/hadoop-lzo
21. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
22. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
23. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
‣ Why shouldn’t we just continue, and codegen more glue?
24. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
‣ Why shouldn’t we just continue, and codegen more glue?
‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
HBaseLoaders
25. Elephant Bird
‣ We have data coming in as protocol buffers via Crane...
‣ Protobufs: codegen for efficient ser-de of data structures
‣ Why shouldn’t we just continue, and codegen more glue?
‣ InputFormats, OutputFormats, Pig LoadFuncs, Pig StoreFuncs, Pig
HBaseLoaders
‣ Also now does part of this with Thrift, soon Avro
‣ And JSON, W3C Logs
26. Challenge: Mutable Data
‣ HDFS is write-once: no seek on write, no append (yet)
‣ Logs are easy.
‣ But our tables change.
27. Challenge: Mutable Data
‣ HDFS is write-once: no seek on write, no append (yet)
‣ Logs are easy.
‣ But our tables change.
‣ Handling rapidly changing data in HDFS: not trivial.
‣
Don’t worry about updated data
‣
Refresh entire dataset
‣
Download updates, tombstone old versions of data, ensure jobs
only run over current versions of data, occasionally rewrite full dataset
28. Challenge: Mutable Data
‣ HDFS is write-once: no seek on write, no append (yet)
‣ Logs are easy.
‣ But our tables change.
‣ Handling changing data in HDFS: not trivial.
29. HBase
‣ Has already solved the update problem
‣ Bonus: low-latency query API
‣ Bonus: rich, BigTable-style data model based on column families
30. HBase at Twitter
‣ Crane loads data directly into HBase
‣ One CF for protobuf bytes, one CF to denormalize columns for
indexing or quicker batch access
‣ Processing updates transparent, so we always have accurate data in
HBase
‣ Pig Loader for HBase in Elephant Bird makes integration with
existing analyses easy
31. HBase at Twitter
‣ Crane loads data directly into HBase
‣ One CF for protobuf bytes, one CF to denormalize columns for
indexing or quicker batch access
‣ Processing updates transparent, so we always have accurate data in
HBase
‣ Pig Loader for HBase in Elephant Bird
32. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage
‣ Data Analysis: Pig, Oink
‣ Data Products
1 Community Open Source
2 Twitter Open Source
33. Enter Pig
‣ High level language
‣ Transformations on sets of records
‣ Process data one step at a time
‣ UDFs are first-class citizens
‣ Easier than SQL?
34. Why Pig?
‣ Because I bet you can read the following script.
35. A Real Pig Script
‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.
37. Pig Democratizes Large-scale Data
Analysis
‣ The Pig version is:
‣ 5% of the code
‣ 5% of the time
‣ Within 30% of the execution time.
‣ Innovation increasingly driven from large-scale data analysis
‣ Need fast iteration to understand the right questions
‣ More minds contributing = more value from your data
38. Pig Examples
‣ Using the HBase Loader
‣ Using the protobuf loaders
39. Pig Workflow
‣ Oink: framework around Pig for loading, combining, running,
post-processing
‣ Everyone I know has one of these
‣ Points to an opening for innovation; discussion beginning
‣
Something we’re looking at: Ruby DSL for Pig, Piglet1
1 http://github.com/ningliang/piglet
40. Counting Big Data
‣ standard counts, min, max, std dev
‣ How many requests do we serve in a day?
‣ What is the average latency? 95% latency?
‣ Group by response code. What is the hourly distribution?
‣ How many searches happen each day on Twitter?
‣ How many unique queries, how many unique users?
‣ What is their geographic distribution?
41. Correlating Big Data
‣ probabilities, covariance, influence
‣ How does usage differ for mobile users?
‣ How about for users with 3rd party desktop clients?
‣ Cohort analyses
‣ Site problems: what goes wrong at the same time?
‣ Which features get users hooked?
‣ Which features do successful users use often?
‣ Search corrections, search suggestions
‣ A/B testing
42. Research on Big Data
‣ prediction, graph analysis, natural language
‣ What can we tell about a user from their tweets?
‣ From the tweets of those they follow?
‣ From the tweets of their followers?
‣ From the ratio of followers/following?
‣ What graph structures lead to successful networks?
‣ User reputation
43. Research on Big Data
‣ prediction, graph analysis, natural language
‣ Sentiment analysis
‣ What features get a tweet retweeted?
‣ How deep is the corresponding retweet tree?
‣ Long-term duplicate detection
‣ Machine learning
‣ Language detection
‣ ... the list goes on.
44. The Twitter Data Lifecycle
‣ Data Input
‣ Data Storage
‣ Data Analysis
‣ Data Products: Birdbrain
1 Community Open Source
2 Twitter Open Source
45. Data Products
‣ Ad Hoc Analyses
‣ Answer questions to keep the business agile, do research
‣ Online Products
‣ Name search, other upcoming products
‣ Company Dashboard
‣ Birdbrain
46. Questions? Follow me at
twitter.com/kevinweil
‣ P.S. We’re hiring. Help us build the next step: realtime big data analytics.
TM