The document is a presentation about Hadoop and how it can be used to manage large amounts of data. It provides an overview of Hadoop, including what it is, how it works, and some of its main components like HDFS and MapReduce. It also discusses examples of how Hadoop has been implemented at large companies like Facebook and Yahoo to handle petabytes of data and power applications for tasks like analytics, search, and optimization. The presentation aims to explain the benefits of Hadoop for solving "big data" problems.
Elasticsearch sur Azure : Make sense of your (BIG) data !Microsoft
Sous licence Apache2, elasticsearch est un moteur de recherche puissant, distribué et scalable. Il fournit également des agrégations en temps réel en fonction de vos besoins. Couplé à Kibana, dashboard générique et hautement personnalisable, il vous permet de donner immédiatement du sens à vos données. En forte progression au niveau de son adhésion par les entreprises et les sites publics, découvrez ce que sont elasticsearch et Kibana et à quel point il est simple de les déployer facilement sur la plate-forme Windows Azure. Thomas et David illustreront à l'aide de cas clients les bénéfices obtenus à travers ces solutions.
Speakers : Thomas Conté (Microsoft), David Pilato (Elasticsearch)
Elasticsearch sur Azure : Make sense of your (BIG) data !Microsoft
Sous licence Apache2, elasticsearch est un moteur de recherche puissant, distribué et scalable. Il fournit également des agrégations en temps réel en fonction de vos besoins. Couplé à Kibana, dashboard générique et hautement personnalisable, il vous permet de donner immédiatement du sens à vos données. En forte progression au niveau de son adhésion par les entreprises et les sites publics, découvrez ce que sont elasticsearch et Kibana et à quel point il est simple de les déployer facilement sur la plate-forme Windows Azure. Thomas et David illustreront à l'aide de cas clients les bénéfices obtenus à travers ces solutions.
Speakers : Thomas Conté (Microsoft), David Pilato (Elasticsearch)
Spark is a powerful framework for distributed processing of massive datasets. With an interactive shell, machine learning libraries, and in-memory data structures, Spark provides a tool set for high performance advanced analytics. Connecting Spark with MongoDB enables you to achieve sophisticated back-end analytics in combination with the performance of MongoDB. We'll take a look at how these two systems integrate with one another through sample code and demonstrations.
Presentation from Bryan Reneiro, Developer Advocate at MongoDB.
Using Apache ACE as a distribution and management platform for a large--and growing-- number of embedded devices in the field.
I used this presentation at Apachecon NA 2010.
I'm more about story and images than about text on slides, you can try to follow along here.
OSGi technology is becoming the preferred approach for creating highly modular and dynamically extensible applications. With open source framework implementations like Eclipse Equinox and Apache Felix readily available, there is no better time to move to OSGi technology. However, doing so requires to master the assembly, provisioning, and discovery of the components that make-up your system. Apache ACE, an Apache Incubator project, is a software distribution framework that allows to centrally manage and distribute software components, configuration data, and other artifacts to target systems. We will focus on building and managing OSGi deployments, showing you how to use Apache ACE to bootstrap a framework and deploy to remotely managed systems. Also, we will show how ACE can be used to deploy bundles to an Android based phone.
Spark is a powerful framework for distributed processing of massive datasets. With an interactive shell, machine learning libraries, and in-memory data structures, Spark provides a tool set for high performance advanced analytics. Connecting Spark with MongoDB enables you to achieve sophisticated back-end analytics in combination with the performance of MongoDB. We'll take a look at how these two systems integrate with one another through sample code and demonstrations.
Presentation from Bryan Reneiro, Developer Advocate at MongoDB.
Using Apache ACE as a distribution and management platform for a large--and growing-- number of embedded devices in the field.
I used this presentation at Apachecon NA 2010.
I'm more about story and images than about text on slides, you can try to follow along here.
OSGi technology is becoming the preferred approach for creating highly modular and dynamically extensible applications. With open source framework implementations like Eclipse Equinox and Apache Felix readily available, there is no better time to move to OSGi technology. However, doing so requires to master the assembly, provisioning, and discovery of the components that make-up your system. Apache ACE, an Apache Incubator project, is a software distribution framework that allows to centrally manage and distribute software components, configuration data, and other artifacts to target systems. We will focus on building and managing OSGi deployments, showing you how to use Apache ACE to bootstrap a framework and deploy to remotely managed systems. Also, we will show how ACE can be used to deploy bundles to an Android based phone.
CouchDB presentation with some technical details, made for a technical audience, shows use cases, comparison to other nosql databases and why it's useful for publishers
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
An introduction to cloud programming models and the Skywriting project. Talk originally given at the University of Cambridge, on 11th June 2010.
More information about the Skywriting project can be found here: http://www.cl.cam.ac.uk/netos/skywriting/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
2. Hadoop and Cloudera
Managing Petabytes with Open Source
Jeff Hammerbacher
Chief Scientist and Vice President of Products, Cloudera
October 27, 2009
Tuesday, October 27, 2009
3. Why You Should Care
Hadoop in the Life Sciences
▪ CloudBurst: Highly Sensitive Short Read Mapping with MapReduce
▪ “CloudBurst reduces the running time from hours to mere minutes”
▪ Crossbow: Genotyping from short reads using cloud computing
▪ “Crossbow shows how Hadoop can be a enabling technology for
computational biology”
▪ SMARTS substructure searching using the CDK and Hadoop
▪ “The Hadoop framework makes handling large data problems pretty
much trivial”
▪ Smith-Waterman Protein Alignment
▪ “Existing algorithms ported easily to Hadoop”
Tuesday, October 27, 2009
4. My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Conceived, built, and led Data team at Facebook
▪ Nearly 30 amazing engineers and data scientists
▪ Several open source projects and research papers
▪ Founder of Cloudera
▪ Vice President of Products and Chief Scientist (other titles)
▪ Also, check out the book “Beautiful Data”
Tuesday, October 27, 2009
5. Presentation Outline
▪ What is Hadoop?
▪ HDFS
▪ MapReduce
▪ Hive, Pig, Avro, Zookeeper, and friends
▪ Solving big data problems with Hadoop at Facebook and Yahoo!
▪ Short history of Facebook’s Data team
▪ Hadoop applications at Yahoo!, Facebook, and Cloudera
▪ Other examples: LHC, smart grid, genomes
▪ Questions and Discussion
Tuesday, October 27, 2009
6. What is Hadoop?
▪ Apache Software Foundation project, mostly written in Java
▪ Inspired by Google infrastructure
▪ Software for programming warehouse-scale computers (WSCs)
▪ Hundreds of production deployments
▪ Project structure
▪ Hadoop Distributed File System (HDFS)
▪ Hadoop MapReduce
▪ Hadoop Common
▪ Other subprojects
▪ Avro, HBase, Hive, Pig, Zookeeper
Tuesday, October 27, 2009
7. Anatomy of a Hadoop Cluster
▪ Commodity servers
▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC
▪ Typically arranged in 2 level architecture
▪
Commodity
40 nodes per rack Hardware Cluster
▪ Inexpensive to acquire and maintain
•! Typically in 2 level architecture
–! Nodes are commodity Linux PCs
Tuesday, October 27, 2009 –! 40 nodes/rack
8. HDFS
▪ Pool commodity servers into a single hierarchical namespace
▪ Break files into 128 MB blocks and replicate blocks
▪ Designed for large files written once but read many times
▪ Files are append-only
▪ Two major daemons: NameNode and DataNode
▪ NameNode manages file system metadata
▪ DataNode manages data using local filesystem
▪ HDFS manages checksumming, replication, and compression
▪ Throughput scales nearly linearly with node cluster size
▪ Access from Java, C, command line, FUSE, or Thrift
Tuesday, October 27, 2009
10. Hadoop MapReduce
▪ Fault tolerant execution layer and API for parallel data processing
▪ Can target multiple storage systems
▪ Key/value data model
▪ Two major daemons: JobTracker and TaskTracker
▪ Many client interfaces
▪ Java
▪ C++
▪ Streaming
▪ Pig
▪ SQL (Hive)
Tuesday, October 27, 2009
11. MapReduce
MapReduce pushes work out to the data
(#)**+%$#41'%
Q" K"
#)5#0$#.1%*6%(/789%
)#$#%)&'$3&:;$&*0% !" Q"
'$3#$1.<%$*%+;'"%=*34%
N" N"
*;$%$*%>#0<%0*)1'%&0%#%
?@;'$13A%B"&'%#@@*='%
#0#@<'1'%$*%3;0%&0% K"
+#3#@@1@%#0)%1@&>&0#$1'%
$"1%:*$$@101?4'% P"
&>+*'1)%:<%>*0*@&$"&?% !"
'$*3#.1%'<'$1>'A%
Q"
K"
P"
P"
!"
N"
"
!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'
Tuesday, October 27, 2009 "
12. Hadoop Subprojects
▪ Avro
▪ Cross-language framework for RPC and serialization
▪ HBase
▪ Table storage on top of HDFS, modeled after Google’s BigTable
▪ Hive
▪ SQL interface to structured data stored in HDFS
▪ Pig
▪ Language for data flow programming; also Owl, Zebra, SQL
▪ Zookeeper
▪ Coordination service for distributed systems
Tuesday, October 27, 2009
13. Hadoop Community Support
▪ 185 total contributors to the open source code base
▪ 60 engineers at Yahoo!, 15 at Facebook, 15 at Cloudera
▪ Over 500 (paid!) attendees at Hadoop World NYC
▪ Hadoop World Beijing later this month
▪ Three books (O’Reilly, Apress, Manning)
▪ Training videos free online
▪ Regular user group meetups in many cities
▪ University courses across the world
▪ Growing consultant and systems integrator expertise
▪ Commercial training, certification, and support from Cloudera
Tuesday, October 27, 2009
14. Hadoop Project Mechanics
▪ Trademark owned by ASF; Apache 2.0 license for code
▪ Rigorous unit, smoke, performance, and system tests
▪ Release cycle of 3 months (-ish)
▪ Last major release: 0.20.0 on April 22, 2009
▪ 0.21.0 will be last release before 1.0; nearly complete
▪ Subprojects on different release cycles
▪ Releases put to a vote according to Apache guidelines
▪ Releases made available as tarballs on Apache and mirrors
▪ Cloudera packages own release for many platforms
▪ RPM and Debian packages; AMI for Amazon’s EC2
Tuesday, October 27, 2009
15. Hadoop at Facebook
Early 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier
▪ Intensive historical analysis difficult
▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL
▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Tuesday, October 27, 2009
16. Facebook Data Infrastructure
2007
Scribe Tier MySQL Tier
Data Collection
Server
Oracle Database
Server
Tuesday, October 27, 2009
17. Facebook Data Infrastructure
2008
Scribe Tier MySQL Tier
Hadoop Tier
Oracle RAC Servers
Tuesday, October 27, 2009
18. Major Data Team Workloads
▪ Data collection
▪ server logs
▪ application databases
▪ web crawls
▪ Thousands of multi-stage processing pipelines
▪ Summaries consumed by external users
▪ Summaries for internal reporting
▪ Ad optimization pipeline
▪ Experimentation platform pipeline
▪ Ad hoc analyses
Tuesday, October 27, 2009
19. Workload Statistics
Facebook 2009
▪ Largest cluster running Hive: 4,800 cores, 5.5 PB of storage
▪ 4 TB of compressed new data added per day
▪ 135TB of compressed data scanned per day
▪ 7,500+ Hive jobs on per day
▪ 80K compute hours per day
▪ Around 200 people per month run Hive jobs
Tuesday, October 27, 2009
20. Hadoop at Yahoo!
▪ Jan 2006: Hired Doug Cutting
▪ Apr 2006: Sorted 1.9 TB on 188 nodes in 47 hours
▪ Apr 2008: Sorted 1 TB on 910 nodes in 209 seconds
▪ Aug 2008: Deployed 4,000 node Hadoop cluster
▪ May 2009: Sorted 1 TB on 1,460 nodes in 62 seconds
▪ Data Points
▪ Over 25,000 nodes running Hadoop across 17 clusters
▪ Hundreds of thousands of jobs per day
▪ Typical HDFS cluster: 1,400 nodes, 2 PB capacity
▪ Sorted 1 PB on 3,658 nodes in 16.25 hours
Tuesday, October 27, 2009
21. Example Hadoop Applications
▪ Yahoo!
▪ Yahoo! Search Webmap
▪ Content and ad targeting optimization
▪ Facebook
▪ Fraud and abuse detection
▪ Lexicon (text mining)
▪ Cloudera
▪ Facial recognition for automatic tagging
▪ Genome sequence analysis
▪ Financial services, government, and of course: HEP!
Tuesday, October 27, 2009
22. Cloudera Offerings
Only One Slide, I Promise
▪ Two software products
▪ Cloudera’s Distribution for Hadoop
▪ Cloudera Desktop
▪ Training and Certification
▪ For Developers, Operators, and Managers
▪ Support
▪ Professional services
Tuesday, October 27, 2009
23. Cloudera Desktop
Big Data can be Beautiful
Tuesday, October 27, 2009
24. (c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Tuesday, October 27, 2009