From Raghu Ramakrishnan's presentation "Key Challenges in Cloud Computing and How Yahoo! is Approaching Them" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/510
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
Presentation by Gerben de Boer (van Oord) at the Symposium Earth Observation and Data Science, during Delft Software Days - Edition 2017. Thursday, 2 November 2017, Delft.
Recording: https://www.youtube.com/watch?v=qHkXVY2LpwU
External links: https://gist.github.com/itamarhaber/dddc3d4d9c19317b1477
Applications today are required to process massive amounts of data and return responses in real time. Simply storing Big Data is no longer enough; insights must be gleaned and decisions made as soon as data rushes in. In-memory databases like Redis provide the blazing fast speeds required for sub-second application response times. Using a combination of in-memory Redis and disk-based MongoDB can significantly reduce the “digestive” challenge associated with processing high velocity data.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Big data and Hadoop are frameworks for processing and storing large datasets. Hadoop uses HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines for redundancy and parallel access. MapReduce divides jobs into map and reduce tasks that run in parallel across a cluster. Hadoop provides scalable and fault-tolerant solutions to problems like processing terabytes of data from jet engines or scaling to Google's data processing needs.
Why You Definitely Don’t Want to Build Your Own Time Series DatabaseInfluxData
At Outlyer, an infrastructure monitoring tool, we had to build our own TSDB back in 2015 to support our service. Two years later, we decided to take a different direction after seeing for ourselves how hard it is to build and scale a TSDB. This talk will review our journey, the challenges we hit trying to scale a TSDB for large customers and hopefully talk some people out of trying to build one themselves because it is not easy!
From Raghu Ramakrishnan's presentation "Key Challenges in Cloud Computing and How Yahoo! is Approaching Them" at the 2009 Cloud Computing Expo in Santa Clara, CA, USA. Here's the talk description on the Expo's site: http://cloudcomputingexpo.com/event/session/510
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
Presentation by Gerben de Boer (van Oord) at the Symposium Earth Observation and Data Science, during Delft Software Days - Edition 2017. Thursday, 2 November 2017, Delft.
Recording: https://www.youtube.com/watch?v=qHkXVY2LpwU
External links: https://gist.github.com/itamarhaber/dddc3d4d9c19317b1477
Applications today are required to process massive amounts of data and return responses in real time. Simply storing Big Data is no longer enough; insights must be gleaned and decisions made as soon as data rushes in. In-memory databases like Redis provide the blazing fast speeds required for sub-second application response times. Using a combination of in-memory Redis and disk-based MongoDB can significantly reduce the “digestive” challenge associated with processing high velocity data.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Big data and Hadoop are frameworks for processing and storing large datasets. Hadoop uses HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines for redundancy and parallel access. MapReduce divides jobs into map and reduce tasks that run in parallel across a cluster. Hadoop provides scalable and fault-tolerant solutions to problems like processing terabytes of data from jet engines or scaling to Google's data processing needs.
Why You Definitely Don’t Want to Build Your Own Time Series DatabaseInfluxData
At Outlyer, an infrastructure monitoring tool, we had to build our own TSDB back in 2015 to support our service. Two years later, we decided to take a different direction after seeing for ourselves how hard it is to build and scale a TSDB. This talk will review our journey, the challenges we hit trying to scale a TSDB for large customers and hopefully talk some people out of trying to build one themselves because it is not easy!
In the ppt i have explained the basic difference between the hadoop architectures.
hadoop architecture 1 and hadoop architecture 2
i have taken the reference from the website for the preperation.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It has two main components:
1) The Hadoop Distributed File System (HDFS) which stores data reliably across commodity hardware. It divides files into blocks and replicates them for fault tolerance.
2) MapReduce, which processes data in parallel. It handles scheduling, input partitioning, and failover. Users write mapping and reducing functions. Mappers process key-value pairs and output new pairs shuffled to reducers, which combine values for each key.
Ryan will expand on his popular blog series and drill down into the internals of the database. Ryan will discuss optimizing query performance, best indexing schemes, how to manage clustering (including meta and data nodes), the impact of IFQL on the database, the impact of cardinality on performance, TSI, and other internals that will help you architect better solutions around InfluxDB.
Redis & MongoDB: Stop Big Data Indigestion Before It StartsItamar Haber
Efficiently digesting data in large volumes can prove to be challenging for any database. The challenges are compounded when this influx must be analyzed on the fly, or "tasted", to satisfy the sophisticated palates of modern apps. Luckily, there are several proven remedies you can concoct with Redis to help with potential indigestion.
The URLs from the presentation are also available at: https://gist.github.com/itamarhaber/325e515c1715a12ef132
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
This document provides an overview of Apache Cassandra including its history, architecture, data modeling concepts, and how to install and use it with Python. Key points include that Cassandra is a distributed, scalable NoSQL database designed without single points of failure. It discusses Cassandra's architecture including nodes, datacenters, clusters, commit logs, memtables, and SSTables. Data modeling concepts explained are keyspaces, column families, and designing for even data distribution and minimizing reads. The document also provides examples of creating a keyspace, reading data using Python driver, and demoing data clustering.
This document provides an introduction and overview of Hadoop. It discusses the brief history of Hadoop, including its origins from Google papers in 2005 and promotion by Yahoo since 2006. It then discusses why Hadoop is useful for big data applications that are petabyte in scale, scalable, robust, and secure. Specific use cases like analytics, reporting, filtering and machine learning on log files, user behavior data, and other structured or unstructured data sources are covered. Finally, it outlines the Hadoop ecosystem and tools like native Java APIs, Pig, Hive, and streaming options for other languages.
Hive is a data warehouse infrastructure built on top of Hadoop that enables easy data summarization and analysis of large data volumes. It provides a simple query language called HiveQL that allows users to plug in custom mappers and reducers. Hive supports managed and external tables with different storage formats and uses a metastore to store metadata. While it allows SQL-like queries over data, Hive is not a relational database and does not support features like transactions or row-level updates.
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
Alluxio is a virtual distributed file system that serves as a data access layer between applications and storage systems. It provides a unified interface, improved performance through caching, and enables transparent migration between storage systems. Alluxio deployed with Presto on cloud storage like S3 can provide 5x faster query performance through caching query data in Alluxio workers located with compute. Case studies show how Alluxio improved response times for analytics workloads at large companies by eliminating remote data access and enabling data locality.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Presto on Alluxio Hands-On Lab
Speakers:
Bin Fan, Alluxio
Zac Blanco, Alluxio
Kamil Bajda-Pawlikowski, Starburst, Presto Company
Martin Traverso, Presto Software Foundation
For more Alluxio events: https://www.alluxio.io/events/
The document summarizes a workshop agenda for new InfluxData practitioners. It outlines the schedule of presentations and topics to be covered throughout the day-long workshop, including installing and querying the TICK stack, chronograf dashboarding, writing queries, architecting InfluxEnterprise, optimizing the TICK stack, and downsampling data. The final presentation on downsampling data is given by Michael DeSa and covers the concepts of downsampling, why it is useful, and how to perform it in InfluxDB using continuous queries and Kapacitor.
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
Alluxio Tech Talk
Feb 12, 2019
Speaker:
Dipti Borkar, Alluxio
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more.
Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs.
In this webinar, we will discuss:
- Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated
- The new challenges that this new paradigm introduces
- An introduction to Alluxio and the unified data solution it provides for hybrid environments
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
February 23, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service, which show text mining on the "Enron Email Dataset" from Infochimps.com plus data visualization using R and Gephi
Source at: http://github.com/ceteri/ceteri-mapred
CityLABS Workshop: Working with large tablesEnrico Daga
This document discusses working with large tables and big data processing. It introduces distributed computing as an approach to process large datasets by distributing data across multiple nodes and parallelizing operations. The document then outlines using Apache Hadoop and the MK Data Hub cluster to distribute data storage and processing. It demonstrates how to use tools like Hue, Hive, and Pig to analyze tabular data in a distributed manner at scale. Finally, hands-on examples are provided for computing TF-IDF statistics on the large Gutenberg text corpus.
This document introduces distributed computing and tools for processing large tabular data using the Big Data Cluster. It discusses how distributed computing allows tabular data to be replicated across nodes and computation to be parallelized. It then provides an overview of Hadoop and how the Big Data Cluster can be used with tools like Hue, Hive, and Pig to perform analytics on large datasets. Finally, it walks through an example of computing TF-IDF scores on a corpus of text documents from Project Gutenberg.
The document discusses challenges with enterprise machine learning and proposes Kubeflow as a solution. It notes that data scientists prefer different tools, models are difficult to deploy and manage at scale, and a shared platform is needed. Kubeflow is presented as providing a "lab and factory" environment to explore ideas and reproducibly run models at scale using containers, Kubernetes, notebooks, pipelines and other tools. It aims to help machine learning models progress from research to production.
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationBeGooden-IT Consulting
This document discusses documenting large legacy 4GL applications. It introduces the 4GL App Analyzer tool, which can automatically generate documentation like functional flow charts, comments in source code, and more from 4GL source code. The tool outputs documentation in HTML and other formats. Using tags in source code comments allows more detailed documentation of modules, functions, variables, forms and other elements. The documentation provides a centralized resource for understanding an application and planning its future.
The document provides an overview of Hadoop including what it is, how it works, its architecture and components. Key points include:
- Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models.
- It consists of HDFS for storage and MapReduce for processing via parallel computation using a map and reduce technique.
- HDFS stores data reliably across commodity hardware and MapReduce processes large amounts of data in parallel across nodes in a cluster.
The document discusses managing Drupal projects with Git. It describes using a "base build" repository as an upstream for new Drupal sites, containing common code. New sites clone the base build and treat it as an upstream remote, developing on their own origin remote. Periodically merging upstream keeps sites in sync while allowing independent development. This approach aims to minimize maintenance pain while focusing on user experiences rather than DevOps tasks.
Delft FEWS Adapter Python. Developing adapters for Delft FEWS to demonstrate models within FEWS. Bayesian Inference using Stan PyMC3 with existing URBS hydrologic models
In the ppt i have explained the basic difference between the hadoop architectures.
hadoop architecture 1 and hadoop architecture 2
i have taken the reference from the website for the preperation.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It has two main components:
1) The Hadoop Distributed File System (HDFS) which stores data reliably across commodity hardware. It divides files into blocks and replicates them for fault tolerance.
2) MapReduce, which processes data in parallel. It handles scheduling, input partitioning, and failover. Users write mapping and reducing functions. Mappers process key-value pairs and output new pairs shuffled to reducers, which combine values for each key.
Ryan will expand on his popular blog series and drill down into the internals of the database. Ryan will discuss optimizing query performance, best indexing schemes, how to manage clustering (including meta and data nodes), the impact of IFQL on the database, the impact of cardinality on performance, TSI, and other internals that will help you architect better solutions around InfluxDB.
Redis & MongoDB: Stop Big Data Indigestion Before It StartsItamar Haber
Efficiently digesting data in large volumes can prove to be challenging for any database. The challenges are compounded when this influx must be analyzed on the fly, or "tasted", to satisfy the sophisticated palates of modern apps. Luckily, there are several proven remedies you can concoct with Redis to help with potential indigestion.
The URLs from the presentation are also available at: https://gist.github.com/itamarhaber/325e515c1715a12ef132
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
This document provides an overview of Apache Cassandra including its history, architecture, data modeling concepts, and how to install and use it with Python. Key points include that Cassandra is a distributed, scalable NoSQL database designed without single points of failure. It discusses Cassandra's architecture including nodes, datacenters, clusters, commit logs, memtables, and SSTables. Data modeling concepts explained are keyspaces, column families, and designing for even data distribution and minimizing reads. The document also provides examples of creating a keyspace, reading data using Python driver, and demoing data clustering.
This document provides an introduction and overview of Hadoop. It discusses the brief history of Hadoop, including its origins from Google papers in 2005 and promotion by Yahoo since 2006. It then discusses why Hadoop is useful for big data applications that are petabyte in scale, scalable, robust, and secure. Specific use cases like analytics, reporting, filtering and machine learning on log files, user behavior data, and other structured or unstructured data sources are covered. Finally, it outlines the Hadoop ecosystem and tools like native Java APIs, Pig, Hive, and streaming options for other languages.
Hive is a data warehouse infrastructure built on top of Hadoop that enables easy data summarization and analysis of large data volumes. It provides a simple query language called HiveQL that allows users to plug in custom mappers and reducers. Hive supports managed and external tables with different storage formats and uses a metastore to store metadata. While it allows SQL-like queries over data, Hive is not a relational database and does not support features like transactions or row-level updates.
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
Alluxio is a virtual distributed file system that serves as a data access layer between applications and storage systems. It provides a unified interface, improved performance through caching, and enables transparent migration between storage systems. Alluxio deployed with Presto on cloud storage like S3 can provide 5x faster query performance through caching query data in Alluxio workers located with compute. Case studies show how Alluxio improved response times for analytics workloads at large companies by eliminating remote data access and enabling data locality.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Presto on Alluxio Hands-On Lab
Speakers:
Bin Fan, Alluxio
Zac Blanco, Alluxio
Kamil Bajda-Pawlikowski, Starburst, Presto Company
Martin Traverso, Presto Software Foundation
For more Alluxio events: https://www.alluxio.io/events/
The document summarizes a workshop agenda for new InfluxData practitioners. It outlines the schedule of presentations and topics to be covered throughout the day-long workshop, including installing and querying the TICK stack, chronograf dashboarding, writing queries, architecting InfluxEnterprise, optimizing the TICK stack, and downsampling data. The final presentation on downsampling data is given by Michael DeSa and covers the concepts of downsampling, why it is useful, and how to perform it in InfluxDB using continuous queries and Kapacitor.
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
Alluxio Tech Talk
Feb 12, 2019
Speaker:
Dipti Borkar, Alluxio
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new problems – how do you co-locate data with compute, how do you unify data across multiple remote clouds, how do you keep storage and I/O service costs down and many more.
Enter Alluxio, a virtual unified file system, which sits between compute and storage that allows you to realize the benefits of a hybrid cloud architecture with the same performance and lower costs.
In this webinar, we will discuss:
- Why leading enterprises are adopting hybrid cloud architectures with compute and storage disaggregated
- The new challenges that this new paradigm introduces
- An introduction to Alluxio and the unified data solution it provides for hybrid environments
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Community Office Hour
February 23, 2021
For more Alluxio events: https://www.alluxio.io/events/
Speaker(s):
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service, which show text mining on the "Enron Email Dataset" from Infochimps.com plus data visualization using R and Gephi
Source at: http://github.com/ceteri/ceteri-mapred
CityLABS Workshop: Working with large tablesEnrico Daga
This document discusses working with large tables and big data processing. It introduces distributed computing as an approach to process large datasets by distributing data across multiple nodes and parallelizing operations. The document then outlines using Apache Hadoop and the MK Data Hub cluster to distribute data storage and processing. It demonstrates how to use tools like Hue, Hive, and Pig to analyze tabular data in a distributed manner at scale. Finally, hands-on examples are provided for computing TF-IDF statistics on the large Gutenberg text corpus.
This document introduces distributed computing and tools for processing large tabular data using the Big Data Cluster. It discusses how distributed computing allows tabular data to be replicated across nodes and computation to be parallelized. It then provides an overview of Hadoop and how the Big Data Cluster can be used with tools like Hue, Hive, and Pig to perform analytics on large datasets. Finally, it walks through an example of computing TF-IDF scores on a corpus of text documents from Project Gutenberg.
The document discusses challenges with enterprise machine learning and proposes Kubeflow as a solution. It notes that data scientists prefer different tools, models are difficult to deploy and manage at scale, and a shared platform is needed. Kubeflow is presented as providing a "lab and factory" environment to explore ideas and reproducibly run models at scale using containers, Kubernetes, notebooks, pipelines and other tools. It aims to help machine learning models progress from research to production.
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationBeGooden-IT Consulting
This document discusses documenting large legacy 4GL applications. It introduces the 4GL App Analyzer tool, which can automatically generate documentation like functional flow charts, comments in source code, and more from 4GL source code. The tool outputs documentation in HTML and other formats. Using tags in source code comments allows more detailed documentation of modules, functions, variables, forms and other elements. The documentation provides a centralized resource for understanding an application and planning its future.
The document provides an overview of Hadoop including what it is, how it works, its architecture and components. Key points include:
- Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models.
- It consists of HDFS for storage and MapReduce for processing via parallel computation using a map and reduce technique.
- HDFS stores data reliably across commodity hardware and MapReduce processes large amounts of data in parallel across nodes in a cluster.
The document discusses managing Drupal projects with Git. It describes using a "base build" repository as an upstream for new Drupal sites, containing common code. New sites clone the base build and treat it as an upstream remote, developing on their own origin remote. Periodically merging upstream keeps sites in sync while allowing independent development. This approach aims to minimize maintenance pain while focusing on user experiences rather than DevOps tasks.
Delft FEWS Adapter Python. Developing adapters for Delft FEWS to demonstrate models within FEWS. Bayesian Inference using Stan PyMC3 with existing URBS hydrologic models
Presentation from the CopenhagenR - useR Group Meetup at IT University of Copenhagen on Oct. 11 2016 on how to automatically deploy web applications built in R to a Cloud server (here DigitalOcean) using open source Docker with GitHub and basic Continuous Integration (here CircleCI) for automated testing and deployment.
Presenter:
Niels Ole Dam, Things in Flow
Excerpt from the invitation to the meetup:
Niels will talk about his favorite R-setup and will demonstrate how R, combined with some nice DockeR and Github tricks, can help even small teams and companies leverage the power of modern cloud computing. Niels uses R on a daily basis in his work as an independent consultant and he will share his thoughts on DockeR at the next meetup.
Subjects covered:
- How to setup and use RStudio, Docker, Docker Compose locally and with GitHub intgration.
- How to setup and use Continuous Integration (CI) with automated testing and deployment to DigitalOcean using CircelCI and with reuse of the same docker-compose.yml file locally and remotely.
- Tips and tricks on how to setup a good workflow.
- Introduction to all the technologies and tools used.
There are lots of clickable links in the pdf-version of the slides.
Code for the setup demonstrated can be found at:
https://github.com/thingsinflow/r-docker-workflow
An accompanying clickable flowdiagram can be found at:
http://bit.ly/R-Docker-workflow
Enjoy!
:-)
This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.
This document summarizes a presentation on using SQL Server Integration Services (SSIS) with HDInsight. It introduces Tillmann Eitelberg and Oliver Engels, who are experts on SSIS and HDInsight. The agenda covers traditional ETL processes, challenges of big data, useful Apache Hadoop components for ETL, clarifying statements about Hadoop and ETL, using Hadoop in the ETL process, how SSIS is more than just an ETL tool, tools for working with HDInsight, getting started with Azure HDInsight, and using SSIS to load and transform data on HDInsight clusters.
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingAnne Nicolas
The Linux kernel features an extensive array of, to put it kindly, somewhat disorganized documentation. A significant effort is underway to make things better, though. This talk will review the state of kernel documentation, cover the changes that are being made (including the adoption of a new system for formatted documentation), and discuss how interested developers can help.
Jonathan Corbet, LWN.net
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/16WDJ8b.
Eli Collins overviews how to build new applications with Hadoop and how to integrate Hadoop with existing applications, providing an update on the state of Hadoop ecosystem, frameworks and APIs.Filmed at qconnewyork.com.
Eli Collins is the tech lead for Cloudera's Platform team, an active contributor to Apache Hadoop and member of its project management committee (PMC) at the Apache Software Foundation. Eli holds Bachelor's and Master's degrees in Computer Science from New York University and the University of Wisconsin-Madison, respectively.
Shaping the Future: To Globus Compute and Beyond!Globus
This document introduces funcX, a new service for managing remote computation similar to how Globus manages data transfers. FuncX allows users to register Python functions and execute them on remote resources through a simple interface. It aims to make remote computing as easy as transferring files by handling authentication, resource configuration, and moving code and results between systems. The document demonstrates how funcX is being used for scientific applications like statistical model fitting and inverse spectroscopy that require running thousands of tasks on HPC resources. It solicits feedback to help guide the funcX roadmap and identifies common use cases around scaling jobs, automating workflows, and enabling community access through shared endpoints and functions.
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
This document provides an overview of steps to build an agile analytics application, beginning with raw event data and ending with a web application to explore and visualize that data. The steps include:
1) Serializing raw event data (emails, logs, etc.) into a document format like Avro or JSON
2) Loading the serialized data into Pig for exploration and transformation
3) Publishing the data to a "database" like MongoDB
4) Building a web interface with tools like Sinatra, Bootstrap, and JavaScript to display and link individual records
The overall approach emphasizes rapid iteration, with the goal of creating an application that allows continuous discovery of insights from the source data.
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
This document provides an introduction to big data and Hadoop. It discusses how the volume of data being generated is growing rapidly and exceeding the capabilities of traditional databases. Hadoop is presented as a solution for distributed storage and processing of large datasets across clusters of commodity hardware. Key aspects of Hadoop covered include MapReduce for parallel processing, the Hadoop Distributed File System (HDFS) for reliable storage, and how data is replicated across nodes for fault tolerance.
The document provides an introduction to the Hadoop ecosystem. It discusses the history of Hadoop, originating from Google's paper on MapReduce and Google File System. It describes some of the core components of Hadoop including HDFS for storage, MapReduce for distributed processing, and additional components like Hive, Pig, and HBase. It also discusses different Hadoop distributions from companies like Cloudera, Hortonworks, MapR, and others that package and support Hadoop deployments.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document discusses text research and the Text-Fabric data model. It describes Text-Fabric as a data model for annotated text corpora, a query engine, a text weaver, and an API. The data model transforms TEI-XML into separate feature files to untangle annotations and enable better data logistics. Computational research involves gathering data from repositories, modeling and analyzing it, publishing results back to repositories, and discussing conclusions in notebooks. Publishing work flows include building websites to deliver research outputs to the general public more accessibly.
Towards TextPy, a module for processing text.
If we define annotated text as a graph with additional structure, we can make text processing more efficient, in the same way that Pandas makes processing dataframes more efficient.
We demonstrate how Text-Fabric can handle the display of text and annotations, even when chunks of text are not properly embedded in each other. This demo contains examples from the Hebrew Bible and the Old Babylonian Letters (cuneiform clay tablets).
This document discusses applying data analysis techniques used for ancient corpora to the Quran. It presents Text-Fabric (TF) as a graph database model for storing textual data in plain text files without XML or SQL. TF models text as nodes for words and phrases connected by edge relationships, and stores components like words, phrases, chapters and verses that can be uniquely identified. The document provides an example of a TF dataset containing parsed text from Iain M. Banks' novel "Consider Phlebas".
Researchers in ancient text corpora can take control over their data. We show a way to do so by means of Text-Fabric.
Co-production of Cody Kingham and Dirk Roorda
This document summarizes the history and current state of BHSA (Basic Handwritten Script Analysis) tools. It describes early tools like EMDROS and SHEBANQ, as well as more recent projects like Text-Fabric that encode texts in a graph structure with minimal encoding. Text-Fabric files separate each feature of the data into individual files for easy processing and combination. The document outlines Text-Fabric data, sharing, starting with the tool, publishing with it, and available apps and corpora. It promotes Text-Fabric's concepts of transparent, contributor-friendly encodings and provides links to relevant GitHub repositories and tutorials.
Developing a tool for handling text with linguistic annotations. Text-Fabric is meant to support researchers that wnat to contribute portions of the data, and weaves the contributions in into a meaningful whole. Currently, it is primarily meant for working with the Hebrew Bible, based on the ETCBC (Amsterdam) linguistic database.
Conference presentation for 2016 annual meeting of the Society of Biblical Literature, San Antonio. (https://www.sbl-site.org).
Authors: Janet Dyk (linguistic ideas) and Dirk Roorda (computational implementation).
A verb organizes the elements in a sentence. Different patterns of constituents affect the meaning of a verb in a given context. The potential of a verb to combine with patterns of elements is known as its valence. A single set of questions, organized as a flow chart, selects the relevant building blocks within the context of a verb. The resulting pattern provides a particular significance for the verb in question. Because all contexts are submitted to the same flow chart, similarities and differences between verbs come to light. For example, verbs of movement in their causative formation manifest the same patterns as transitive verbs with an object that gets moved. We apply this approach to the whole Hebrew Bible, using the database of the Eep Talstra Centre for Bible and Computer (ETCBC), which contains the relevant linguistic annotations. This allows us to have a complete listing of all patterns for all verbs. It provides the basis for consistent proposals for the significance of specific patterns occurring with a particular verb. The valence results are made available in SHEBANQ, an online research tool based on the ETCBC database. It presents the basic data, text and linguistic features, together with annotations by researchers. The valence results consist of a set of algorithmically generated annotations which show up between the lines of the text. The algorithm itself and its documentation can be found at https://shebanq.ancient-data.org/tools?goto=valence. By using SHEBANQ we achieve several goals with respect to the scholarly workflow: (1) all our results are openly accessible online, and other researchers may comment on them; (2) all resources needed to reproduce this research are available online and can be downloaded (Open Access).
This document provides an overview of the SHEBANQ project, which provides tools for querying annotated Hebrew text data. It describes the data sources and contributors that have built up the underlying text corpus over many years. It also outlines the steps taken to make this data and related tools more accessible, including developing a website, depositing data in archives, running demonstration projects, and integrating the data and tools into broader research environments through additional projects and publications. The goal has been to facilitate wider use of this linguistic resource and foster more digital humanities and data science work based on its contents.
1. The document discusses layers of annotation for analyzing biblical Hebrew text, including the text itself, linguistic features, manually or automatically generated analyses, and queries for exegetical search.
2. It provides an overview of the Linguistic Annotation Framework (LAF) for representing annotated text and statistics on the annotation of one Hebrew text, with over 800,000 regions and 1.4 million nodes.
3. The document describes tools for querying the annotated text, including the SHEBANQ system and LAF-Fabric API, and the ability to work with the data in various formats like XML, binary files, and R.
20151111 utrecht ver theolbibliothecarissenDirk Roorda
DANS is an institute of the Royal Netherlands Academy of Arts and Sciences and the Netherlands Organization for Scientific Research that promotes permanent access to digital research data. It provides data archiving services including depositing datasets in its online repository EASY, which ensures the data is findable, referable, downloadable, usable, and supports scholarly communication through publication of data papers. DANS also works with research organizations using a front office-back office model to facilitate long-term preservation of research data.
Text as Data: processing the Hebrew BibleDirk Roorda
The merits of stand-off markup (LAF) versus inline markup (TEI) for processing text as data. Ideas applied to work with the Hebrew Bible, resulting in tools for researchers and end-users.
Datamanagement for Research: A Case StudyDirk Roorda
How practices of data sharing can help researchers to produce more science.
Session in the data management course organized by RDNL (Research Data in the Netherlands)
Hebrew Bible as Data: Laboratory, Sharing, LessonsDirk Roorda
The document discusses using the Hebrew Bible as a data source for research. It describes several databases and tools for querying and analyzing the data, including ETCBC, SHEBANQ, and LAF-Fabric. It provides an overview of how the data is created, archived, shared and disseminated through the research data cycle. Examples are given of using LAF-Fabric to count nodes, write plain text, and visualize annotations. The goal is to make the Hebrew Bible and linguistic annotations available as linked open data for various types of researchers.
LAF-Fabric: a tool to process the ETCBC Hebrew Text Database in Linguistic Annotation Framework.
How researchers in theology and linguistics can create workflows to analyse the text of the Hebrew Bible and extract data for visualization. Those workflows can be written in Python, and run conveniently in the IPython Notebook.
Joint work with Martijn Naaijer (VU University).
With the Hebrew Bible encoded in Linguistic Annotation Framework (LAF-ISO), and with a new LAF processing tool, we demonstrate how you can do practical data analysis. The tool, LAF-Fabric, integrates with the ipython notebook approach. Our example here is lexeme cooccurrence analysis of bible books. For now, the road from data to visualization is more important than the exact visualization.
The document describes the Linguistic Annotation Framework (LAF), which is an ISO standard for representing stand-off annotation of language resources. LAF allows for annotating text with linguistic information like part-of-speech tags or named entities in an XML format. Example annotated text corpora using LAF include the Open American National Corpus and a text database of the Hebrew Bible. The document then discusses challenges with existing LAF processors and introduces LAF-Fabric as a new tool that compiles LAF annotations into binary data for faster querying of linguistic features and running Python scripts against the data.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxOH TEIK BIN
(A Free eBook comprising 3 Sets of Presentation of a selection of Puzzles, Brain Teasers and Thinking Problems to exercise both the mind and the Right and Left Brain. To help keep the mind and brain fit and healthy. Good for both the young and old alike.
Answers are given for all the puzzles and problems.)
With Metta,
Bro. Oh Teik Bin 🙏🤓🤔🥰
How to Download & Install Module From the Odoo App Store in Odoo 17Celine George
Custom modules offer the flexibility to extend Odoo's capabilities, address unique requirements, and optimize workflows to align seamlessly with your organization's processes. By leveraging custom modules, businesses can unlock greater efficiency, productivity, and innovation, empowering them to stay competitive in today's dynamic market landscape. In this tutorial, we'll guide you step by step on how to easily download and install modules from the Odoo App Store.
🔥🔥🔥🔥🔥🔥🔥🔥🔥
إضغ بين إيديكم من أقوى الملازم التي صممتها
ملزمة تشريح الجهاز الهيكلي (نظري 3)
💀💀💀💀💀💀💀💀💀💀
تتميز هذهِ الملزمة بعِدة مُميزات :
1- مُترجمة ترجمة تُناسب جميع المستويات
2- تحتوي على 78 رسم توضيحي لكل كلمة موجودة بالملزمة (لكل كلمة !!!!)
#فهم_ماكو_درخ
3- دقة الكتابة والصور عالية جداً جداً جداً
4- هُنالك بعض المعلومات تم توضيحها بشكل تفصيلي جداً (تُعتبر لدى الطالب أو الطالبة بإنها معلومات مُبهمة ومع ذلك تم توضيح هذهِ المعلومات المُبهمة بشكل تفصيلي جداً
5- الملزمة تشرح نفسها ب نفسها بس تكلك تعال اقراني
6- تحتوي الملزمة في اول سلايد على خارطة تتضمن جميع تفرُعات معلومات الجهاز الهيكلي المذكورة في هذهِ الملزمة
واخيراً هذهِ الملزمة حلالٌ عليكم وإتمنى منكم إن تدعولي بالخير والصحة والعافية فقط
كل التوفيق زملائي وزميلاتي ، زميلكم محمد الذهبي 💊💊
🔥🔥🔥🔥🔥🔥🔥🔥🔥
How to Manage Reception Report in Odoo 17Celine George
A business may deal with both sales and purchases occasionally. They buy things from vendors and then sell them to their customers. Such dealings can be confusing at times. Because multiple clients may inquire about the same product at the same time, after purchasing those products, customers must be assigned to them. Odoo has a tool called Reception Report that can be used to complete this assignment. By enabling this, a reception report comes automatically after confirming a receipt, from which we can assign products to orders.
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapitolTechU
Slides from a Capitol Technology University webinar held June 20, 2024. The webinar featured Dr. Donovan Wright, presenting on the Department of Defense Digital Transformation.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...indexPub
The recent surge in pro-Palestine student activism has prompted significant responses from universities, ranging from negotiations and divestment commitments to increased transparency about investments in companies supporting the war on Gaza. This activism has led to the cessation of student encampments but also highlighted the substantial sacrifices made by students, including academic disruptions and personal risks. The primary drivers of these protests are poor university administration, lack of transparency, and inadequate communication between officials and students. This study examines the profound emotional, psychological, and professional impacts on students engaged in pro-Palestine protests, focusing on Generation Z's (Gen-Z) activism dynamics. This paper explores the significant sacrifices made by these students and even the professors supporting the pro-Palestine movement, with a focus on recent global movements. Through an in-depth analysis of printed and electronic media, the study examines the impacts of these sacrifices on the academic and personal lives of those involved. The paper highlights examples from various universities, demonstrating student activism's long-term and short-term effects, including disciplinary actions, social backlash, and career implications. The researchers also explore the broader implications of student sacrifices. The findings reveal that these sacrifices are driven by a profound commitment to justice and human rights, and are influenced by the increasing availability of information, peer interactions, and personal convictions. The study also discusses the broader implications of this activism, comparing it to historical precedents and assessing its potential to influence policy and public opinion. The emotional and psychological toll on student activists is significant, but their sense of purpose and community support mitigates some of these challenges. However, the researchers call for acknowledging the broader Impact of these sacrifices on the future global movement of FreePalestine.
2. Generale Missieven
• yearly letters from governor
and board of the Dutch East
Indian Company to the Dutch
government (Heren XVII)
• 1610-1761
• 13 volumes
• 565 letters
• 10,000 pages
resources.huygens.knaw.nl/vocgeneralemissiven
4. Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
It is a toolkit / model / framework /
ethos to
1. get corpus data into RAM
2. compute with it efficiently
3. harvest results
4. recycle results back to the corpus
and to do this in a way that
1. is reproducible
2. reduces friction
7. Source: TEI
page number
it's ok for automatic
processing,
very discouraging for
manual checking and
double checking
very long lines
inhuman file names
8. Laundry - trim0
• some pages are hopeless
• we re-sourced data from the OCR strings of the
Huygens website
• cases:
• letters without original content not in TEI (but
there is editorial content and metadata)
• pages with big tables (landscape) resulted in
pathological TEI
9. Humane data!
file names
are page
numbers metadata is flattened
much of the XML overhead is gone
line breaks are
reflected in the
layout
All the inherent
problems in this
dataset are still there.
But now we have
hope to see them,
to tackle them.
10. Laundry - trim1
text separation:
• mark folio references
• correct the markup of page
headers
without this step:
• loss of original text
• contamination of original text
vol. 2 p 538
before
after
11. Laundry - trim2
• metadata
• re-distil from
letter headings
• check
• diagnostics
before
after
12. Laundry - trim3 - the mother of all laundries
• get the editorial remarks under tight control
even when they spread across pages
• detect all 12,000+ footnote bodies correctly (done)
• connect all footnote refs to their bodies (done)
None of this is feasible without successful completion of the previous steps.
18. Centrifuge
• Result:
• clean, dry stuff: Text-Fabric
github.com/Dans-labs/clariah-gm/tf/
With clean XML in hand, We centrifuge
the XML out of the clean laundry:
• we squeeze out all tag material
(moisture)
• leaving only pure content (dry clothes)
• ready to process (ready to wear)
23. • start
• move around programmatically
• search
• get in focus
• compute
• refine by computing
• exportExcel
• collect work sheets
• annotate
• insights are the new data
• share
• let others collect your data as easily as you
collected this corpus
annotation/tutorials/missieven
30. what does this road mean?
• for researchers?
• for CLARIAH?
• for DANS / eScience Center / Humanities Cluster / HuygensING
31. researchers
• short road to be completely "hands
on" with their own corpora
• compute in their first programming
language: "XML"
• no technological overhead outside
their computing scope: XML, RDF, PID
• no metadata intricacy
• focus on data according to their own
mental concepts: the data features
TF corpora
32. CLARIAH
• a unified practice to compute with corpora:
• students of different corpora can share practices
• they can build cookbooks that transcend their
particular corpus
• remember "peculiarity of missives"?
• nearly the same recipe exists for a dozen
corpora
• where is greater gain:
• sorting out metadata?
• support the processing of metadata ?
TF corpora
33. DANS / eScience / HuC / archives
Text-Fabric uses GitHub as data-backend!
• GitHub is unique in supporting versioned data check-in / check-out
• GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation
YET:
• GH is optimized for code, not (big) data
• although you can do private repos, there GH has little support for access roles
AND
• GH's diffing techniques maybe over the top for data
34. DANS / eScience / HuC / archives
We need another data backend:
• based on the practices of a FAIR repository
• where researchers have the same kind of control as they have in GitHub
• that supports versioning
• where you can download specific versions of specific subfolders of
specific datasets under program control: API
35. DANS / eScience / HuC / archives
• We need a TextHub, a Data Station for processable, annotated Text
• One corpus has many authors that deliver many parts of the data
• Authors control their own parts and share them from places they "own" on
the Hub
• Users grab those parts from the Hub under program control
• And deliver the new parts they create to the Hub
36. DANS / eScience / HuC / archives
DANS: provide the Hub (Data Station in Dataverse)
eScience: support best computing practices around the Hub
HuC: consider the Hub as a hop-on to larger infrastructure
Archives: invest in resources on the shelf: make them Hub ready
37. Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
corpus data into memory
compute
harvest
share & recycle
be reproducible
go smoothly
dirk.roorda@dans.knaw.nl