國立臺灣大學電機所博士生,平時致力於推廣 R 語言,曾主辦多場 R 語言推廣講座,並經常於 Taiwan R User Group 分享 R 的使用心得。有豐富的 R 語言實務經驗,包含資料的收集、整理、分析到報告製作。擅長根據專案需求,量身打造 R 的資料分析系統,以及運用 R 和 C++ 撰寫高效能演算法。
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. It will cover functionality and useful features of the HDF5 utilities, which include h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. The tutorial will also introduce recently changes and new features of the utilities.
The HDFView is a visual tool for browsing and editing HDF4 and HDF5 files. Some basic features and new changes of HDFView will be presented. Details of recent development in HDF-Java products will be discussed in a separate presentation.
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides
This tutorial is designed for anyone who needs to work with data stored in HDF5 files. It will cover functionality and useful features of the HDF5 utilities, which include h5dump, h5diff, h5repack, h5stat, h5copy, h5check and h5repart. The tutorial will also introduce recently changes and new features of the utilities.
The HDFView is a visual tool for browsing and editing HDF4 and HDF5 files. Some basic features and new changes of HDFView will be presented. Details of recent development in HDF-Java products will be discussed in a separate presentation.
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
Big Data - these two words are heard so often nowadays. But what exactly is Big Data ? Can we, Pythonistas, enter the wonder world of Big Data ? The answer is definitely “Yes”.
This talk is an introduction to the big data processing using Apache Hadoop and Python. We’ll talk about Apache Hadoop, it’s concepts, infrastructure and how one can use Python with it. We’ll compare the speed of Python jobs under different Python implementations, including CPython, PyPy and Jython and also discuss what Python libraries are available out there to work with Apache Hadoop.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
Federated SPARQL Query Processing With Replicated FragmentPascal Molli
Federated query engines allow to consume linked data from SPARQL endpoints. Replicating data fragments from different sources allows to re-organize data to better fit federated query processing of data consumers. However, existing federated query engines poorly sup- port replication. In this paper, we propose a replication-aware federated query engine that extends state-of-art federated query engine ANAPSID and FedX with Fedra, a source selection strategy that approximates the source selection problem with fragments replication (SSP-FR). For a given set of endpoints with replicated fragments and a SPARQL query, the problem is to find the endpoints to contact in order to minimize the number of tuples to transfer from endpoints to the federated query engines. We devise the Fedra source selection algorithm that approximates SSP-FR. We implement Fedra in the state-of-the-art federated query engines FedX and ANAPSID, and empirically evaluate their performance. Experimental results suggest that Fedra efficiently solves SSP-FR, reducing the number of selected SPARQL endpoints as well as the size of query intermediate results.
在這資料科學逐漸成為顯學的年代,無論面對的是資料的幾個 V,其中最重要的永遠都是 Value (價值) 這個 V,而資料探勘正是一種透過系統化的方式釐清資料的脈絡、找出其中有價值的特徵與相關性的技術。這門六小時的課程,將從最實務的角度切入,與大家分享如何將現實中極待解決的問題,轉換成可以利用資料探勘技術處理的問題,並且運用 R 語言中各種強大的工具,進行關聯性分析、迴歸分析以及叢聚分析,以達成將資料中隱藏的資訊挖掘出來的最終目標。
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
A talk presented at an NSF Workshop on Data-Intensive Computing, July 30, 2009.
Extreme scripting and other adventures in data-intensive computing
Data analysis in many scientific laboratories is performed via a mix of standalone analysis programs, often written in languages such as Matlab or R, and shell scripts, used to coordinate multiple invocations of these programs. These programs and scripts all run against a shared file system that is used to store both experimental data and computational results.
While superficially messy, the flexibility and simplicity of this approach makes it highly popular and surprisingly effective. However, continued exponential growth in data volumes is leading to a crisis of sorts in many laboratories. Workstations and file servers, even local clusters and storage arrays, are no longer adequate. Users also struggle with the logistical challenges of managing growing numbers of files and computational tasks. In other words, they face the need to engage in data-intensive computing.
We describe the Swift project, an approach to this problem that seeks not to replace the scripting approach but to scale it, from the desktop to larger clusters and ultimately to supercomputers. Motivated by applications in the physical, biological, and social sciences, we have developed methods that allow for the specification of parallel scripts that operate on large amounts of data, and the efficient and reliable execution of those scripts on different computing systems. A particular focus of this work is on methods for implementing, in an efficient and scalable manner, the Posix file system semantics that underpin scripting applications. These methods have allowed us to run applications unchanged on workstations, clusters, infrastructure as a service ("cloud") systems, and supercomputers, and to scale applications from a single workstation to a 160,000-core supercomputer.
Swift is one of a variety of projects in the Computation Institute that seek individually and collectively to develop and apply software architectures and methods for data-intensive computing. Our investigations seek to treat data management and analysis as an end-to-end problem. Because interesting data often has its origins in multiple organizations, a full treatment must encompass not only data analysis but also issues of data discovery, access, and integration. Depending on context, data-intensive applications may have to compute on data at its source, move data to computing, operate on streaming data, or adopt some hybrid of these and other approaches.
Thus, our projects span a wide range, from software technologies (e.g., Swift, the Nimbus infrastructure as a service system, the GridFTP and DataKoa data movement and management systems, the Globus tools for service oriented science, the PVFS parallel file system) to application-oriented projects (e.g., text analysis in the biological sciences, metagenomic analysis, image analysis in neuroscience, information integration for health care applications, management of experimental data from X-ray sources, diffusion tensor imaging for computer aided diagnosis), and the creation and operation of national-scale infrastructures, including the Earth System Grid (ESG), cancer Biomedical Informatics Grid (caBIG), Biomedical Informatics Research Network (BIRN), TeraGrid, and Open Science Grid (OSG).
For more information, please see www.ci.uchicago/swift.
Federated SPARQL Query Processing With Replicated FragmentPascal Molli
Federated query engines allow to consume linked data from SPARQL endpoints. Replicating data fragments from different sources allows to re-organize data to better fit federated query processing of data consumers. However, existing federated query engines poorly sup- port replication. In this paper, we propose a replication-aware federated query engine that extends state-of-art federated query engine ANAPSID and FedX with Fedra, a source selection strategy that approximates the source selection problem with fragments replication (SSP-FR). For a given set of endpoints with replicated fragments and a SPARQL query, the problem is to find the endpoints to contact in order to minimize the number of tuples to transfer from endpoints to the federated query engines. We devise the Fedra source selection algorithm that approximates SSP-FR. We implement Fedra in the state-of-the-art federated query engines FedX and ANAPSID, and empirically evaluate their performance. Experimental results suggest that Fedra efficiently solves SSP-FR, reducing the number of selected SPARQL endpoints as well as the size of query intermediate results.
在這資料科學逐漸成為顯學的年代,無論面對的是資料的幾個 V,其中最重要的永遠都是 Value (價值) 這個 V,而資料探勘正是一種透過系統化的方式釐清資料的脈絡、找出其中有價值的特徵與相關性的技術。這門六小時的課程,將從最實務的角度切入,與大家分享如何將現實中極待解決的問題,轉換成可以利用資料探勘技術處理的問題,並且運用 R 語言中各種強大的工具,進行關聯性分析、迴歸分析以及叢聚分析,以達成將資料中隱藏的資訊挖掘出來的最終目標。
Opening Keynote for HadoopCon 2014
我們的身邊、網路上,圍繞著太多的 Big Data 論述與技術,Hadooper 今天聚集在這裡,都已經是 Big Data 的相關利益者,然而, 今天我們所理解的 Big Data,大部分都是透過自身的體驗而來,但 Hadoop Ecosystem 太過龐雜,Use Case 不同,必須取不同的 OSS 專案來完成,如此想來,我們哪一個人何曾看過所有的 Big Data 風景呢?
此 Talk 告訴我們如何透過更多的風景之窗,將 Big Data 的不同天地,看得更多更透。
在此課程中將帶領對資料分析感到陌生卻又充滿興趣的您,完整地學會運用 R 語言從最初的蒐集資料、探索性分析解讀資料,並進行文字探勘,發現那些肉眼看不見、隱藏在資料底下的意義。此課程主要設計給對於 R 語言有基本認識,想要進一步熟悉實作分析的朋友們,希望在課程結束後,您能夠更熟悉 R 語言這個豐富的分析工具。透過蘋果日報慈善捐款的資料集,了解如何從頭解析網頁,撰寫爬蟲自動化收集資訊;取得資料後,能夠靈活處理資料,做清洗、整合及探索;並利用現成的套件進行文字探勘、文本解析;我們將一步步實際走一回資料分析的歷程,處理、觀察、解構資料,試著看看人們在捐款的決策過程中,究竟是什麼因素產生了影響,以及這些結果又是如何從資料中挖掘而出的呢?
Exploratory data analysis is the process of quickly looking at data, formulating hypotheses, and testing those hypotheses. In practice, two of the most important components of this process are transforming data and visualizing it. This tutorial will be a hands-on, practical introduction to using R for data exploration, with an emphasis on data transformation and visualization. I will focus on using modern R packages like ggplot2, dplyr, and tidyr for this tutorial.
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...Nelson Calero
Each new version of the Oracle database includes improvements in the upgrade and patching utilities, forcing us to update our procedures to incorporate these changes.
The Fleet Provisioning & Patching (FPP, formerly RHP) utility, together with the change in its licensing announced at OOW 2019 that makes it free in RAC, now makes it possible to centrally manage the software life cycle.
This presentation shows examples of how to use FPP and different configuration options.
The PAC aims to promote engagement between various experts from around the world, to create relevant, value-added content sharing between members. For Neotys, to strengthen our position as a thought leader in load & performance testing.
Since its beginning, the PAC is designed to connect performance experts during a single event. In June, during 24 hours, 20 participants convened exploring several topics on the minds of today’s performance tester such as DevOps, Shift Left/Right, Test Automation, Blockchain and Artificial Intelligence.
These slides are designed for people who already have some background in coding, but are new to the R language and environment.
This presentation took between 1.5 to 2 hours.
The slides were created with RMarkdown, so all the code shown here should run exactly in R.
#Interactive Session by Ashwini Lalit, RRR of Test Automation Maintenance" at...Agile Testing Alliance
#Interactive Session by Ashwini Lalit, RRR of Test Automation Maintenance" at #ATAGTR2023.
#ATAGTR2023 was the 8th Edition of Global Testing Retreat.
To know more about #ATAGTR2023, please visit: https://gtr.agiletestingalliance.org/
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
Slides for the Webinar, presented on March 6, 2019
For the webinar video visit https://www.altinity.com/
Extracting business insight from massive pools of machine-generated data is the central analytic problem of the digital era. ClickHouse data warehouse addresses it with sub-second SQL query response on petabyte-scale data sets. In this talk we'll discuss the features that make ClickHouse increasingly popular, show you how to install it, and teach you enough about how ClickHouse works so you can try it out on real problems of your own. We'll have cool demos (of course) and gladly answer your questions at the end.
Speaker Bio:
Robert Hodges is CEO of Altinity, which offers enterprise support for ClickHouse. He has over three decades of experience in data management spanning 20 different DBMS types. ClickHouse is his current favorite. ;)
Doing Quality Assurance in PHP projects sometimes looks like a dark art! Picking the right tools, making all tools work together, analysing your code and even then deliver all the required features of the software project can be quite challenging.
This talks aims to help lowering the entry barrier for doing QA on your project, sharing the experience, knowledge and some tricks that brings QA back from the dark arts to the every day of a PHP programmer.
We will review tools like Jenkins, PHPUnit, phpcs, pdepend, phpcpd, etc and how we can chain them together to make sure we are building a great software.
Speed up your Machine Learning workflows with built-in algorithms - Tel Aviv ...Amazon Web Services
In machine learning, training large models on massive amount of data usually improved results. Our customers report however that training such models and deploying them is either operationally prohibitive or outright impossible for them. Amazon AI Algorithms is designed to solve this problem. It is a collection of distributed streaming ML algorithms that scale to any amount of data. They are fast and efficient because they distribute across CPU.GPU machines and share a collective distributed state via a highly-optimized parameter server.
They scale to an infinite amount of data because they operate in the streaming model. This means they require only one pass over the data and never increase their resources consumption allowing training to be paused resumed and snapshotted and even for algorithms to consume kinesis streams directly providing an "always on" training mechanism. They are production ready. Trained models are automatically containerized and usable in production using Amazon SageMaker hosting. Finally, we provide a convenient SDK which allows scientists to create new algorithms which operate in this model and enjoy all the benefits above.
A Taxonomy of Clustering, or, No Container is an IslandTed M. Young
Covers the evolution from static deployments using individual Docker containers, to dynamic deployments in Kubernetes and Mesos, with a taxonomy of clustering, i.e., what's important about cluster managers.
Talk given at the Software Development & Evolution Conference in Winnipeg, Manitoba, Canada on November 2nd, 2015.
No more Big Data Hacking—Time for a Complete ETL Solution with Oracle Data In...Jérôme Françoisse
How can you use ODI12c to generated your Hive, Pig and Spark jobs? How can you orchestrate the executions directly on the Hadoop Cluster? How to get data in the Hadoop Cluster and how to move it to an RDBMS?
Everything is answered in this session presented at Oracle Open World 2015
Malware Detection with OSSEC HIDS - OSSECCON 2014Santiago Bassett
My presentation on how to use malware indicators of compromise to create rootcheck signatures for OSSEC. Explains different malware collection and analysis techniques.
implementation of a big data architecture for real-time analytics with data s...Joseph Arriola
My topic presented in DataStax Accelerate 2019 was "Implementation of a Big Data architecture for real-time analytics with DataStax Enterprise Graph, Analytics and Search". To show some of the most widely used open source technologies in the market. and how to integrate them with an Enterprise tool, looking for do real-time analytics.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
19. : AirbnB uses R to scale data
science
Rbnb
Airbnb's engineering, data science, analytics and user
experience teams
Hadoop / SQL R Missing Data
·
·
·
·
500+
How Airbnb uses Machine Learning to Detect Host
Preferences
How well does NPS predict rebooking?
-
-
-
19/80
80. R vs Python
Choosing R or Python for data analysis? An infographic
Pros and Cons of R vs Python Sci-kit learn
Which is better for data analysis: R or Python?
How to Choose Between Learning Python or R First
·
·
·
·
80/80