An overview of traditional spatial analysis tools, an intro to hadoop and other tools for analyzing terabytes or more of data, and then a primer with examples on combining the two with data pulled from the Twitter streaming API. Given at the O'Reilly Where 2.0 conference in March 2010.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
An overview of traditional spatial analysis tools, an intro to hadoop and other tools for analyzing terabytes or more of data, and then a primer with examples on combining the two with data pulled from the Twitter streaming API. Given at the O'Reilly Where 2.0 conference in March 2010.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Tweet alert - semantic analysis in social networks for citizen opinion miningSngular Meaning
Description of a configurable, real-time system for automatic record, analysis and visualization of information from user interactions in Twitter. The system is designed to provide public bodies (government agencies) with a powerful tool to rapidly and easily understand what the citizen behavior trends are, what their opinion about city services, events, etc. is, and also may be used as a primary alert system to improve the efficiency of emergency systems. The citizen is here observed as a proactive city sensor capable of generating huge amounts of very rich, high-level and valuable data through social media platforms, which, after properly processed, summarized and annotated, allows city officers to better understand citizen needs. The architecture and component blocks are described and some key details of the design, implementation and scenarios of application are discussed. Textalytics APIS are used for the semantic analysis of relevant tweets.
Presentation by DAEDALUS, UPM and UC3M at PEGOV 2014, 2nd International Workshop on Personalization in eGovernment Services and Applications, Aalborg, Denmark, in conjunction with the 22nd Conference on User Modeling, Adaptation and Personalization - UMAP 2014.
Measuring CDN performance and why you're doing it wrongFastly
Integrating content delivery networks into your application infrastructure can offer many benefits, including major performance improvements for your applications. So understanding how CDNs perform — especially for your specific use cases — is vital. However, testing for measurement is complicated and nuanced, and results in metric overload and confusion. It's becoming increasingly important to understand measurement techniques, what they're telling you, and how to apply them to your actual content.
In this session, we'll examine the challenges around measuring CDN performance and focus on the different methods for measurement. We'll discuss what to measure, important metrics to focus on, and different ways that numbers may mislead you.
More specifically, we'll cover:
Different techniques for measuring CDN performance
Differentiating between network footprint and object delivery performance
Choosing the right content to test
Core metrics to focus on and how each impacts real traffic
Understanding cache hit ratio, why it can be misleading, and how to measure for it
Demo or Die: Where advertising meets product designChristine Outram
This presentation explores the role of rapid prototyping in the age of digital advertising and how it is transforming a "traditional creative process" into a lean, interactive, and multidisciplinary endeavor. Advertising is evolving; the best ads are not always ads; demo or die.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Social media data for Social science researchDavide Bennato
This is the talk I gave at the Lipari Summer School on Computational social science 2013. What are relationship between social science and big data? With a focus on Twitter and its social media mining tools
http://www.tecnoetica.it/2013/08/07/lipari-summer-school-computational-social-science-big-data-e-twitter/
With the tremendous growth of social networks, there has been a growth in the amount of new data that is being created every minute on these networking sites. The notion of community in this social networking world has caught lots of attention. Studying Twitter is useful for understanding how people use new communication technologies to form social connections and maintain existing ones. We analysed how geo-tagged tweets in Twitter can be used to identify useful user features and behavior as well as identify landmarks/places of interests. We also analysed several clustering algorithms and proposed different similarity measures to detect communities.
O'Reilly Where 2.0 2011
As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.
In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.
Topics:
* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code.
In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms.
I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.
"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Tweet alert - semantic analysis in social networks for citizen opinion miningSngular Meaning
Description of a configurable, real-time system for automatic record, analysis and visualization of information from user interactions in Twitter. The system is designed to provide public bodies (government agencies) with a powerful tool to rapidly and easily understand what the citizen behavior trends are, what their opinion about city services, events, etc. is, and also may be used as a primary alert system to improve the efficiency of emergency systems. The citizen is here observed as a proactive city sensor capable of generating huge amounts of very rich, high-level and valuable data through social media platforms, which, after properly processed, summarized and annotated, allows city officers to better understand citizen needs. The architecture and component blocks are described and some key details of the design, implementation and scenarios of application are discussed. Textalytics APIS are used for the semantic analysis of relevant tweets.
Presentation by DAEDALUS, UPM and UC3M at PEGOV 2014, 2nd International Workshop on Personalization in eGovernment Services and Applications, Aalborg, Denmark, in conjunction with the 22nd Conference on User Modeling, Adaptation and Personalization - UMAP 2014.
Measuring CDN performance and why you're doing it wrongFastly
Integrating content delivery networks into your application infrastructure can offer many benefits, including major performance improvements for your applications. So understanding how CDNs perform — especially for your specific use cases — is vital. However, testing for measurement is complicated and nuanced, and results in metric overload and confusion. It's becoming increasingly important to understand measurement techniques, what they're telling you, and how to apply them to your actual content.
In this session, we'll examine the challenges around measuring CDN performance and focus on the different methods for measurement. We'll discuss what to measure, important metrics to focus on, and different ways that numbers may mislead you.
More specifically, we'll cover:
Different techniques for measuring CDN performance
Differentiating between network footprint and object delivery performance
Choosing the right content to test
Core metrics to focus on and how each impacts real traffic
Understanding cache hit ratio, why it can be misleading, and how to measure for it
Demo or Die: Where advertising meets product designChristine Outram
This presentation explores the role of rapid prototyping in the age of digital advertising and how it is transforming a "traditional creative process" into a lean, interactive, and multidisciplinary endeavor. Advertising is evolving; the best ads are not always ads; demo or die.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Social media data for Social science researchDavide Bennato
This is the talk I gave at the Lipari Summer School on Computational social science 2013. What are relationship between social science and big data? With a focus on Twitter and its social media mining tools
http://www.tecnoetica.it/2013/08/07/lipari-summer-school-computational-social-science-big-data-e-twitter/
With the tremendous growth of social networks, there has been a growth in the amount of new data that is being created every minute on these networking sites. The notion of community in this social networking world has caught lots of attention. Studying Twitter is useful for understanding how people use new communication technologies to form social connections and maintain existing ones. We analysed how geo-tagged tweets in Twitter can be used to identify useful user features and behavior as well as identify landmarks/places of interests. We also analysed several clustering algorithms and proposed different similarity measures to detect communities.
O'Reilly Where 2.0 2011
As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.
In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.
Topics:
* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities
Big Data Analytics with Hadoop with @techmilindEMC
Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
Codemotion Rome 2015 - I Big Data sono indubbiamente tra i temi più "caldi" del panorama tecnologico attuale. Ad oggi nel mondo sono stati prodotti circa 5 Exabytes di dati che costituiscono una potenziale fonte di "intelligenza" che è possibile sfruttare, grazie alle tecnologie più recenti, in svariati ambiti che spaziano dalla medicina alla sociologia passando per il marketing. Il talk si propone, tramite una gita virtuale nello spazio, di introdurre i concetti, le tecniche e gli strumenti che consentono di iniziare a sfruttare il potenziale dei Big Data nel lavoro quotidiano.
Supporting material for hadoop-example (GitHub repository). You can run the codes on your local computer, and then it will help you understand how its MapReduce works.
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
In real-life applications, we often deal with situations where analysis needs to be conducted on graphs where the nodes and edges are associated with multiple labels. For example, in a graph that represents user activities in social networks, the labels associated with nodes may indicate their membership in communities (e.g. group, school, company, etc.), and the labels associated with edges may denote types of activities (e.g. comment, like, share, etc.). The current GraphX library in Spark does not directly support efficient calculation on the label-defined subgraph analysis and computations.
In this session, the speakers will propose a general API library that is able to support analysis on multi-label graphs, and can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions. can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
6. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
7. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
• 10,000 CDs
8. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
• 10,000 CDs
• 5 million floppy disks
9. Data, Data Everywhere
• You guys generate a lot of data
• Anybody want to guess?
• 7 TB/day (2+ PB/yr)
• 10,000 CDs
• 5 million floppy disks
• 225 GB while I give this talk
11. Syslog?
• Started with syslog-ng
• As our volume grew, it didn’t scale
• Resources
overwhelmed
• Lost data
12. Scribe
• Surprise! FB had same problem, built
and open-sourced Scribe
• Log collection framework over Thrift
• You write log lines, with categories
• It does the rest
13. Scribe
FE FE FE
• Runs locally; reliable
in network outage
14. Scribe
FE FE FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer; Agg Agg
hierarchical, scalable
15. Scribe
FE FE FE
• Runs locally; reliable
in network outage
• Nodes only know
downstream writer; Agg Agg
hierarchical, scalable
• Pluggable outputs
File HDFS
16. Scribe at Twitter
• Solved our problem, opened new
vistas
• Currently 30 different categories
logged from javascript, RoR, Scala, etc
• We improved logging, monitoring,
writing to Hadoop, compression
17. Scribe at Twitter
• Continuing to work with FB
• GSoC project! Help make it more
awesome.
• http://github.com/traviscrawford/scribe
• http://wiki.developers.facebook.com/index.php/User:GSoC
19. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
20. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
21. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
22. How do you store 7TB/day?
• Single machine?
• What’s HD write speed?
• 80 MB/s
• 24.3 hrs to write 7 TB
• Uh oh.
23. Where do I put 7TB/day?
• Need a cluster of
machines
24. Where do I put 7TB/day?
• Need a cluster of
machines
• ... which adds new
layers of complexity
25. Hadoop
• Distributed file system
• Automatic replication, fault
tolerance
• MapReduce-based parallel computation
• Key-value based computation
interface allows for wide applicability
26. Hadoop
• Open source: top-level Apache project
• Scalable: Y! has a 4000 node cluster
• Powerful: sorted 1TB random integers
in 62 seconds
• Easy packaging: free Cloudera RPMs
27. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
28. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
29. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
30. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
31. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
32. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
33. Inputs
MapReduce Workflow
Shuffle/
Map Sort
• Challenge: how many tweets per
Map
Outputs user, given tweets table?
Map Reduce • Input: key=row, value=tweet info
• Map: output key=user_id, value=1
Map Reduce
• Shuffle: sort by user_id
Map Reduce • Reduce: for each user_id, sum
Map
• Output: user_id, tweet count
• With 2x machines, runs 2x faster
Map
34. Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
• grep, awk? No way.
• Data is in MySQL... self join on an n-
billion row table?
• n,000,000,000 x n,000,000,000 = ?
35. Two Analysis Challenges
1. Compute friendships in Twitter’s social
graph
• grep, awk? No way.
• Data is in MySQL... self join on an n-
billion row table?
• n,000,000,000 x n,000,000,000 = ?
• I don’t know either.
36. Two Analysis Challenges
2. Large-scale grouping and counting
• select count(*) from users? maybe.
• select count(*) from tweets? uh...
• Imagine joining them.
• And grouping.
• And sorting.
37. Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
38. Back to Hadoop
• Didn’t we have a cluster of machines?
• Hadoop makes it easy to distribute the
calculation
• Purpose-built for parallel calculation
• Just a slight mindset adjustment
• But a fun one!
39. Analysis at Scale
• Now we’re rolling
• Count all tweets: 12 billion, 5 minutes
• Hit FlockDB in parallel to assemble
social graph aggregates
• Run pagerank across users to calculate
reputations
40. But...
• Analysis typically in Java
• Single-input, two-stage data flow is rigid
• Projections, filters: custom code
• Joins lengthy, error-prone
• n-stage jobs: hard to manage
• Exploration requires compilation
47. Pig Makes it Easy
• 5% of the code
• 5% of the dev time
48. Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
49. Pig Makes it Easy
• 5% of the code
• 5% of the dev time
• Within 25% of the running time
• Readable, reusable
50. One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions
51. One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
52. One Thing I’ve Learned
• It’s easy to answer questions.
• It’s hard to ask the right questions.
• Value the system that promotes
innovation and iteration
• More minds contributing = more value
from your data
54. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
55. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
56. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
57. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
58. Counting Big Data
• How many requests per day?
• Average latency? 95% latency?
• Response code distribution per hour?
• Searches per day?
• Unique users searching, unique queries?
• Geographic distribution of queries?
60. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
61. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
62. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
63. Correlating Big Data
• Usage difference for mobile users?
• ... for users on desktop clients?
• Cohort analyses
• What features get users hooked?
• What do successful users use often?
64. Research on Big Data
• What can we tell from a user’s tweets?
65. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
66. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
67. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
68. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
69. Research on Big Data
• What can we tell from a user’s tweets?
• ... from the tweets of their followers?
• ... from the tweets of those they follow?
• What influences retweet tree depth?
• Duplicate detection, language detection
• Machine learning
70. If We Had More Time...
• HBase backing namesearch
• LZO compression
• Protocol Buffers and Hadoop
• Our open source: hadoop-lzo, elephant-
bird
• Realtime analytics with Cassandra