Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
Facing trouble in distinguishing Big Data, Hadoop & NoSQL as well as finding connection among them? This slide of Savvycom team can definitely help you.
Enjoy reading!
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
One of the challenges in storing and processing the data and using the latest internet technologies has resulted in large volumes of data. The technique to manage this massive amount of data and to pull out the value, out of this volume is collectively called Big data. Over the recent years, there has been a rising interest in big data for social media analysis. Online social media have become the important platform across the world to share information. Facebook, one of the largest social media site receives posts in millions every day. One of the efficient technologies that deal with the Big Data is Hadoop. Hadoop, for processing large data volume jobs uses MapReduce programming model. This paper provides a survey on Hadoop and its role in facebook and a brief introduction to HIVE.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Accelerate Enterprise Software Engineering with Platformless
Data infrastructure at Facebook
1. E l a b o r a t e d b y :
Data Infrastructure at Facebook
R e p u b l i c o f Tu n i s i a
M i n i s t r y o f H i g h e r E d u c a t i o n a n d S c i e n t i f i c R e s e a r c h
U n i v e r s i t y o f M o n s a t i r
F a c u l t y o f S c i e n c e s o f M o n a s t i r
Doukh Ahmed
2 0 1 9 - 2 0 2 0
S R A 2
2. WHAT WE WILL DO
Facebook and big Data
Data Warehousing at Facebook
Conclusion
Introduction
Storage systems at Facebook
4. With tens of millions of users and more than a billion page views every day, Facebook ends up
accumulating massive amounts of data.
One of the challenges that he faced since the early days is developing a scalable way of storing
and processing all these bytes since using this historical data is a very big part of how it can
improve the user experience on Facebook.
Introduction
If Facebook were a country, it would be the most populous nation on earth. Running in its 11th
year of success, Facebook stands today as one of the most popular social networking sites,
comprising of 1.59 billion accounts, which is approximately the 1/5th of the world's total
population.
5. About a year back (2010) Facebook began playing around with an open source project
called Hadoop.
Hadoop provides a framework for large scale parallel processing using a distributed file
system and the map-reduce programming paradigm.
First, it start with importing some interesting data sets into a relatively small Hadoop
cluster were quickly rewarded as developers latched on to the map-reduce programming
model and started doing interesting projects that were previously impossible due to their
massive computational requirements.
6. OS
Web server Data Base
Programming
Langage
Communication :Servers /apps
Data Infrastructure
=
7. What is Apache Hadoop ?
Apache Hadoop is a collection of open source software utilities that facilitate
using a network of many computers to solve problems involving massive amounts of data and
computation.
It provides a software framework for distributed storage and processing of big data using
the Map Reduce programming model . Originally design for computers clusters built from
commodity hardware ,has also found use on clusters of higher-end hardware.
Goals of HDFS ?
• Very Large Distributed File System :
– 10K nodes, 100 million files, 10 - 100 PB
• Assumes Commodity Hardware :
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
8. • Optimized for Batch Processing :
– Data locations exposed so that computations can move to where data resides
– Provides very high aggregate bandwidth.
• User Space, runs on heterogeneous OS
9. What is Apache Hive ?
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing
data query and analysis.
Hive gives a SQL-like interface to query data stored in various databases and file systems that
integrate with Hadoop.
Comparison with traditional databases :
The storage and querying operations of Hive closely resemble those of traditional databases.
While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in
comparison to relational databases.
The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to
comply with the restrictions of Hadoop and MapReduce.
10. What is Apache HBASE ?
HBase is an open-source, non relational, distributed database modeled after Google’s bigtable
and written in Java.
It is developed as part of Apache Software Foundation’s, Apache Hadoop project and runs on top
of HDFS (Hadoop Distributed File System) or Alluxio, providing Bigtable-like capabilities for Hadoop.
That is, it provides a Fault tolerant way of storing large quantities of Sparse data (small amounts of
information caught within a large collection of empty or unimportant data, such as finding the 50
largest items in a group of 2 billion records, or finding the non-zero items representing less than
0.1% of a huge collection).
11. Scribe (log server) was a server for aggregating logdata streamed in real-time from a large
number of servers . It was designed to be scalable, extensible without client-side
modification, and robust to failure of the network or any specific machine.
Scribe was developed at Facebook and released in 2008 as open-source.
Scribe servers are arranged in a directed graph, with each server knowing only about the next
server in the graph.
What is Scribe ?
12. Apache ZooKeeper is a software project of the Apache Software Foundation. It is essentially a
centralized service for distributed systems to a hierarchical key-value store, which is used to
provide a distributed configuration service, synchronization service, and naming registry for
large distributed systems.
ZooKeeper was a sub-project of Hadoop but is now a top-level Apache project in its own right.
ZooKeeper was developed in order to fix the bugs that occurred while deploying distributed big
data applications. Some of the prime features of Apache ZooKeeper are:
Reliable System: This system is very reliable as it keeps working even if a node fails.
Simple Architecture: The architecture of ZooKeeper is quite simple as there is a shared
hierarchical namespace which helps coordinating the processes.
Fast Processing: ZooKeeper is especially fast in "read-dominant" workloads (i.e. workloads in
which reads are much more common than writes).
Scalable: The performance of ZooKeeper can be improved by adding nodes..
23. Semi-online Light Transaction Processing Databases (SLTP)
• Facebook Messages and Facebook Time Series
Immutable Data Store
▪ Photos, videos, etc.
Analytics Data Store
▪ Data Warehouse, Logs storage
This is what we
will talk about !
24. Total Size Technology
Facebook Messages
and Time Series Data
Tens of petabytes
Facebook Photos High tens of
petabytes
haystack
Data Warehouse Hundreds of
petabytes
Size and Scale of Databases
This is what we will
talk about !
27. Fig 1 : Data Flow Architecture at Facebook
https://avishkarm.blogspot.com/2013/02/hadoop-architecture-and-its-usage-at.html
28. As shown in the Figure 1:
there are two sources of data :
1. The federated mysql tier that contains all the Facebook site related data.
2. The web tier that generates all the log data .
And there are two different Hive-Hadoop clusters :
1. The production Hive-Hadoop cluster that used to excute jobs that need to adhere
to very strict delivery deadlines.
2. The ad hoc Hive-Hadoop cluster cluster that used to excute lower priority batch
jobs as well as any ad hoc analysis that the users want to do on historical data
sets.
29. Data coming from the web servers
The Scribe servers aggregate the logs coming from different web servers and write them out as
HDFS files in the associated Hadoop cluster
Is pushed to a set of Scribe-Hadoop (scribeh) clusters .
These clusters comprise of Scribe servers running on Hadoop clusters.
More than 30TB of data is transferred to the scribeh clusters every day
In order to reduce the cross data center traffic the scribeh clusters are located in the
data centers hosting the web tiers.
30. Data pushed to Scribe- Hadoop clusters
Periodically is compressed by copier jobs and transferred to the Hive-Hadoop clusters.
The copiers run at 5-15 minute time intervals and copy out all the new files created in the
scribeh clusters, In this manner the log data gets moved to the Hive-Hadoop clusters.
At this point the data is mostly in the form of HDFS files, it gets published either hourly or daily
in the form of partitions in the corresponding Hive tables through a set of loader processes and
then becomes available for consumption.
,
31. Data coming from the federated mysql tier
Is loaded to the Hive- Hadoop clusters through daily scrape processes .
Scrape processes
Dump the desired data sets from mysql databases .
Compressing them on the source systems .
Moving them into the Hive-Hadoop cluster .
The scrapes need to be resilient to failures and also need to be designed such that they do
not put too much load on the mysql databases.
32. The production & The ad hoc Hive Hadoop Clusters
Why Facebook use these two types of Clusters ?
The ad hoc nature of user queries makes it dangerous to run production jobs in the same cluster.
A badly written ad hoc job can hog the resources in the cluster, thereby starving the production
jobs and in the absence of sophisticated sandboxing techniques.
The separation of the clusters for ad hoc and production jobs has become the practical choice for
the company in order to avoid such scenarios.
34. Facebook has multiple Hadoop clusters deployed now with the biggest having about 2500
cpu cores and 1 PetaByte of disk space.
It load over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the
Hadoop file system every day and have hundreds of jobs running each day against these
data sets
The list of projects that are using this infrastructure has proliferated - from those generating
mundane statistics about site usage, to others being used to fight spam and determine
application quality.
Facebook use the information generated by and from users to make decisions about
improvements to the product. Hadoop has enabled the company to make better use of
the data
35. Because the rapid adoption of Hadoop at Facebook :
developers are free to write map-reduce programs in the language of their choice
The company has embraced SQL as a familiar paradigm to address and operate on
large data sets. Most data stored in Hadoop's file system is published as Tables.
Developers can explore the schemas and data of these tables much like they would
do with a good old database , When they want to operate on these data sets, they
can use a small subset of SQL to specify the required dataset.
Operations on datasets can be written as map and reduce scripts or using standard
query operators (like joins and group-bys) or as a mix of the two.
36. A lot of different components(Hadoop(HDFS and Map Reduce),Hive,Scribe,Hbase,Zookeeper…)
come together to provide a comprehensive platform for processing data at Facebook.
This infrastructure is used for various different types of jobs each having
different requirements
Data Infrastructure @ FB built on open source technologies:
Data Infrastructure Overview hadoop+hive+hbase+scribe
GraphQL: créé par Facebook pour permettre la communication entre les applications et les serveurs. Ce langage est maintenant utilisé par un grand nombre d’entreprises.
Facebook est une entreprise qui a énormément contribué à l’essor du Big Data en proposant ses innovations en open-source.