Presented at the ACEMS workshop at QUT in February 2015.
Credits: whole project team (names listed in the first slide).
Approved by CSIRO to be shared externally.
Presentation: Study: #Big Data in #Austria, Mario Meir-Huber, Big Data Leader Eastern Europe, Teradata GmbH & Martin Köhler, Austrian Institute of Technology, AIT (AT), at the European Data Economy Workshop taking place back to back to SEMANTiCS2015 on 15 September 2015 in Vienna.
Data science and visualization lab presentationiHub Research
The Data Science and Visualization Lab! This product is based on a component of research that delves into and innovates on the processes of data science – collection, storage/management, analysis and visualization. You have probably come across one of our amazing info-graphics. What else can you do with data?
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
FUTURE OF DATA SCIENCE IN INDIA
DATA SCIENCE
It is a tool that uses all kinds of data, algorithms and scientific methods. It is a very important tool as it combines two of the most important things in technology and modern science that is mathematics and computer science together. Organizing, data delivery and packaging are the three most important components involved in data science. Data Science handles data works on them and makes conclusion based on the data.
Presentation: Study: #Big Data in #Austria, Mario Meir-Huber, Big Data Leader Eastern Europe, Teradata GmbH & Martin Köhler, Austrian Institute of Technology, AIT (AT), at the European Data Economy Workshop taking place back to back to SEMANTiCS2015 on 15 September 2015 in Vienna.
Data science and visualization lab presentationiHub Research
The Data Science and Visualization Lab! This product is based on a component of research that delves into and innovates on the processes of data science – collection, storage/management, analysis and visualization. You have probably come across one of our amazing info-graphics. What else can you do with data?
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
FUTURE OF DATA SCIENCE IN INDIA
DATA SCIENCE
It is a tool that uses all kinds of data, algorithms and scientific methods. It is a very important tool as it combines two of the most important things in technology and modern science that is mathematics and computer science together. Organizing, data delivery and packaging are the three most important components involved in data science. Data Science handles data works on them and makes conclusion based on the data.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
Quick overview of the latest in big data and artificial intelligence. A lot of buzzwords being thrown around, hopefully this presentation will demystify many of the terms.
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Q: Can I simply hire one rockstar data scientist to cover all this kind of work?
A: No, interdisciplinary work requires teams
A: Hire leads who can speak the lingo of each required discipline
A: Hire individual contributors who cover 2+ roles, when possible
Statistical Thinking – Solve the Whole Problem
BONUS: Meta Organization – Integration with Adjacent Teams
Co-authors Allen Day @allenday and Paco Nathan @pacoid
Presentation: Data Activities in Austria, Lisbeth Mosnik, BMVIT (AT), at the European Data Economy Workshop taking place back to back to SEMANTiCS2015 on 15 September 2015 in Vienna.
Big data is a huge volume of heterogenous data often generated at high speed.Big data cannot be handles with traditional data analytic tools. Hadoop is one of the mostly used big data analytic tool.Map Reduce, hive, hbase are also the tools for analysis in big data.
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
Charles Cai has more than two decades of experience and track records of global transformational programme deliveries – from vision, evangelism to end-to-end execution in global investment banks, and energy trading companies, where he excels at designing and building innovative, large scale, Big Data systems in high volume low latency trading, global Energy Trading & Risk Management, and advanced temporal and geospatial predictive analytics, as Chief Front Office Technical Architect and Head of Data Science. He’s also a frequent speaker at Google Campus, Big Data Innovation Summit, Cloud World Forum, Data Science London, QCon London and MoD CIO Symposium etc, to promote knowledge and best practice sharing, with audience ranging from developers, data scientists, to CXO level senior executives from both IT and business background. He has in-depth knowledge and experience Scala, Python, C# / F#, C++, Node.js, Java, R, Haskell programming languages in Mobile, Desktop, Hadoop/Spark, Cloud IoT/MCU and BlockChain etc, and TOGAF9, EMC-DS, AWS CNE4 etc. certifications.
An introductory but highly practical talk on starting a Data Science career and life. It touches upon all the main aspects of the path towards becoming a Data scientist, also seen through a personal development perspective. Moreover, we talk about the role that a data scientist ultimately fulfills - as an individual or as a team - in the technology innovation life cycle and the product life-cycle.
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaEdureka!
** Machine Learning Engineer Masters Program: https://www.edureka.co/masters-program/machine-learning-engineer-training **
This Edureka Session on Data Science Tools will help you understand the best tools to get you started with Data Science. Here’s a list of topics that are covered in this session:
Introduction To Data Science
Data Science Tools
Data Science Tools For Data Storage
Data Science Tools For Data Manipulation
Data Science Tools For EDA
Data Science Tools For Data Visualization
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
This is the presentation of my mini viva talk given to examiners who assess my PhD's 1st year following the probationary report. It is a summary of my research aims, what I have been doing since the beginning of my 1st year and my plans for the following years of the PhD
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
Quick overview of the latest in big data and artificial intelligence. A lot of buzzwords being thrown around, hopefully this presentation will demystify many of the terms.
A Seminar Presentation on Big Data for Students.
Big data refers to a process that is used when traditional data mining and handling techniques cannot uncover the insights and meaning of the underlying data. Data that is unstructured or time sensitive or simply very large cannot be processed by relational database engines. This type of data requires a different processing approach called big data, which uses massive parallelism on readily-available hardware.
Q: Can I simply hire one rockstar data scientist to cover all this kind of work?
A: No, interdisciplinary work requires teams
A: Hire leads who can speak the lingo of each required discipline
A: Hire individual contributors who cover 2+ roles, when possible
Statistical Thinking – Solve the Whole Problem
BONUS: Meta Organization – Integration with Adjacent Teams
Co-authors Allen Day @allenday and Paco Nathan @pacoid
Presentation: Data Activities in Austria, Lisbeth Mosnik, BMVIT (AT), at the European Data Economy Workshop taking place back to back to SEMANTiCS2015 on 15 September 2015 in Vienna.
Big data is a huge volume of heterogenous data often generated at high speed.Big data cannot be handles with traditional data analytic tools. Hadoop is one of the mostly used big data analytic tool.Map Reduce, hive, hbase are also the tools for analysis in big data.
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
Charles Cai has more than two decades of experience and track records of global transformational programme deliveries – from vision, evangelism to end-to-end execution in global investment banks, and energy trading companies, where he excels at designing and building innovative, large scale, Big Data systems in high volume low latency trading, global Energy Trading & Risk Management, and advanced temporal and geospatial predictive analytics, as Chief Front Office Technical Architect and Head of Data Science. He’s also a frequent speaker at Google Campus, Big Data Innovation Summit, Cloud World Forum, Data Science London, QCon London and MoD CIO Symposium etc, to promote knowledge and best practice sharing, with audience ranging from developers, data scientists, to CXO level senior executives from both IT and business background. He has in-depth knowledge and experience Scala, Python, C# / F#, C++, Node.js, Java, R, Haskell programming languages in Mobile, Desktop, Hadoop/Spark, Cloud IoT/MCU and BlockChain etc, and TOGAF9, EMC-DS, AWS CNE4 etc. certifications.
An introductory but highly practical talk on starting a Data Science career and life. It touches upon all the main aspects of the path towards becoming a Data scientist, also seen through a personal development perspective. Moreover, we talk about the role that a data scientist ultimately fulfills - as an individual or as a team - in the technology innovation life cycle and the product life-cycle.
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaEdureka!
** Machine Learning Engineer Masters Program: https://www.edureka.co/masters-program/machine-learning-engineer-training **
This Edureka Session on Data Science Tools will help you understand the best tools to get you started with Data Science. Here’s a list of topics that are covered in this session:
Introduction To Data Science
Data Science Tools
Data Science Tools For Data Storage
Data Science Tools For Data Manipulation
Data Science Tools For EDA
Data Science Tools For Data Visualization
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
This is the presentation of my mini viva talk given to examiners who assess my PhD's 1st year following the probationary report. It is a summary of my research aims, what I have been doing since the beginning of my 1st year and my plans for the following years of the PhD
This slideshow provides an overview for best practices for visual analysis within Tableau. This is intended for anyone who wants to tell more compelling stories with their data.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
Automate your Data Science pipeline with Ansible, Python and Kubernetes - ODSC Talk
What is Data Science and the Data Science Landscape
Process and Flow
Understanding Data
The Data Science Toolkit
The Big Data Challenge
Cloud Computing Solutions
The rise of DevOps in Data Science
Automate your data pipeline with Ansible
Data Science at Scale - The DevOps ApproachMihai Criveti
DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
A technical Introduction to Big Data AnalyticsPethuru Raj PhD
This presentation gives the details about the sources for big data, the value of big data, what to do with big data, the platforms, the infrastructures and the architectures for big data analytics
Developed by Google’s Artificial Intelligence division, the Sycamore quantum processor boasts 53 qubits1.
In 2019, it achieved a feat that would take a state-of-the-art supercomputer 10,000 years to accomplish: completing a specific task in just 200 seconds1
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Re...Chris Andrews
The acquisition of labeled, unbiased, high quality remote sensing information for training AI systems is expensive, error prone, and sometimes impossible or dangerous. The efficacy of Remote Sensing and Imagery Analysis tools that use AI depends directly on the data used for training and validation, meaning that the cost and availability of data limits the application of AI for imagery exploitation. Synthetic Computer Vision (CV) data has become a strategy to reduce the cost and limitations of using real-world data in detection problems in data sparse domains. Focusing on remote sensing data including visible and invisible electromagnetic spectra, attendees will learn about the expanding options for generating synthetic data that are being used in commercial and academic domains, the technology options available for users who want to create CV content of a variety of types, and patterns of creating synthetic data to support
Learning Objectives
- Describe synthetic data including different types such as Generative AI and physics-based data
- Identify the opportunities for applying synthetic data in place of real sensor data
Will be able to describe the steps required to generate synthetic data for computer vision workflows from concept to production for training and validating AI.
- The intent of this class is to introduce the concepts and mechanisms behind the creation of synthetic data and to expose students to approaches for generating synthetic data using tools currently on the market.
- Familiarity with concepts around AI training and validation using remotely sensed data will be helpful for attendees.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch: https://bit.ly/2DYsUhD
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- How Prologis accelerated their use of Machine Learning with data virtualization
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
IBM's Watson is a machine-learning platform that’s been built to mirror the same learning process that humans have: Observe, Interpret, Evaluate and Decide. Through the use of this cognitive framework, Watson can search through a database of information and pull out key insights to bridge gaps in human knowledge. It’s expertise scaling for enterprise.
Watson has already helped businesses across a variety of industries increase their customer engagement, data discovery and informed decision making abilities. Is your business next?
Bridging the Gap: Analyzing Data in and Below the CloudInside Analysis
The Briefing Room with Dean Abbott and Tableau Software
Live Webcast July 23, 2013
http://www.insideanalysis.com
Today’s desire for analytics extends well beyond the traditional domain of Business Intelligence. That’s partly because business users are realizing the value of mixing and matching all kinds of data, from all kinds of sources. One emerging market driver is Cloud-based data, and the desire companies have to analyze this data cohesively with their on-premise data sets.
Register for this episode of The Briefing Room to learn from Analyst Dean Abbott, who will explain how the ability to access data in the cloud can play a critical role for generating business value from analytics. He’ll be briefed by Ellie Fields of Tableau Software who will tout Tableau’s latest release, which includes native connectors to cloud-based applications like Salesforce.com, Amazon Redshift, Google Analytics and BigQuery. She’ll also demonstrate how Tableau can combine cloud data with other data sources, including spreadsheets, databases, cubes and even Big Data.
A changing market landscape and open source innovations are having a dramatic impact on the consumability and ease of use of data science tools. Join this session to learn about the impact these trends and changes will have on the future of data science. If you are a data scientist, or if your organization relies on cutting edge analytics, you won't want to miss this!
The global need to securely derive (instant) insights, have motivated data architectures from distributed storage, to data lakes, data warehouses and lake-houses. In this talk we describe Tag.bio, a next generation data mesh platform that embeds vital elements such as domain centricity/ownership, Data as Products, Self-serve architecture, with a federated computational layer. Tag.bio data products combine data sets, smart APIs, statistical and machine learning algorithms into decentralized data products for users to discover insights using FAIR Principles. Researchers can use its point and click (no-code) system to instantly perform analysis and share versioned, reproducible results. The platform combines a dynamic cohort builder with analysis protocols and applications (low-code) to drive complex analysis workflows. Applications within data products are fully customizable via R and Python plugins (pro-code), and the platform supports notebook based developer environments with individual workspaces.
Join us for a talk/demo session on Tag.bio data mesh platform and learn how major pharma industries and university health systems are using this technology to promote value based healthcare, precision healthcare, find cures for disease, and promote collaboration (without explicitly moving data around). The talk also outlines Tag.bio secure data exchange features for real world evidence datasets, privacy centric data products (confidential computing) as well as integration with cloud services
Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics'AlmereDataCapital
Presentatie van Roland Haeve (Atos): 'Using the Cloud for Big Data Analytics' tijdens het Big Data Analytics seminar 14 juni van Almere DataCapital in Almere.
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
2017 StrataHadoop SJC conference talk. https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56047
Description:
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #DataScienceHappiness.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
Similar to Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015 (20)
Expanded Perception and Interaction Centre (EPICentre)Tomasz Bednarz
Expanded Perception and Interaction Centre (EPICentre) is a pioneering high-performance visualisation facility. It forges new ground in integrated thinking (artistic and scientific) to facilitate understanding of complex datasets and ultra-scale imagery. EPICentre promotes cross connection of visualization with applied computational simulations, artificial intelligence (AI), and creativity in arts and science.
"SoS" recreates the experiences of two Syrian asylum seekers as they lose sight of each other during a treacherous ocean voyage from Indonesia to Northern Australia.
Demoscene (Underground Real-Time Art) was born in the computer underground, and demos are the product of extreme programming and self-expression (see for example http://youtu.be/UmS6LtNwMcE). Many demoscene productions are inspired by real science, which is presented in very creative ways – visuals synchronised with the music to achieve maximum awesomeness, but also sending strong message to the viewer. Come and listen to stories about connecting design, art and science together, and also about some coding tricks.
Open presentation, training material. Presented at CSIRO Big Data 2.0 workshop in September 2013, North Ryde, Australia. Animated by hands-on examples.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. February 2015
1. CSIRO DIGITAL PRODUCTIVITY FLAGSHIP
Platform for Big Data Analytics and Visual Analytics:
CSIRO Use Cases
Tomasz Bednarz | Research Team Leader
23rd February 2015 | Statistical Modelling and Analysis of Big Data Workshop 2015
The ARC Centre of Excellence in Mathematical and Statistical Frontiers in Big Data, Big Models and New Insights
Project Team: Piotr Szul, Yulia Arzhaeva, Luke Domanski, Ryan Lagerstrom, Surya Nepal, John Zic, John Taylor
2. Platform for Big Data Analytics and Visual Analytics
CSIRO Computational Simulation Sciences TCP project, Digital Productivity
Flagship Platform for Big Data Analytics and Visual Analytics
Dual use of Platform:
• Support and foster a community around Big Data processing and visualisation
• Provide computing tools and services supporting CSIRO specific Big Data
Analytics needs
What will the tools be:
• Facility (software + hardware)
• Portable VM or container image (run everywhere)
Platform for Big Data Analytics and Visual Analytics
3. Platform for Big Data Analytics and Visual Analytics
Definition
Platform for Big Data Analytics and Visual Analytics
Platform is a software solution stack (on
hardware infrastructure) that support
development of big data analytics and
visual analytics applications.
It is:
• Scalable: give appropriate hardware can
scale to petabytes of data and
thousands of nodes.
• Universal: can be deployed on variety of
computational platforms (clouds, HPC
clusters, dedicated clusters, can use
GPGPUs transparently).
• Integrated: is integrated with relevant
CSIRO systems (e.g. Digital Access Portal,
Bowen Clouds).
4. Isn’t Big Data a solved problem?
Can’t we just install the most popular software and be done with it?
No….for CSIRO, it is more complex
Science vs Commercial has a different set of needs
CSIRO = many disciplines/applications = different tool requirements
CSIRO = diverse large scale storage facilities, discipline specific/optimised data
cubes, HPC parallel storage systems
CSIRO = diverse set of compute infrastructure
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Why?
5. What does Big Data Analytics mean to Science?
Big data software survey and analysis
R Big data package survey and analysis
Conceptual Platform Design
Planning layered architecture
– Big picture view: available software,
CSIRO Infrastructure + Science
Plan of attack
Assessment of user requirements
User and project group outreach
Workshop Questionnaires and Abstracts
Platform for Big Data Analytics and Visual Analytics
What we’ve been doing?
Understanding
6. Understand
Big Data Analytics in Science?
Scientist & CSIRO specific needs
Tools and software landscape
Big Picture Design
Forest from the trees
Layering: General to Specific, extensible, clear
boundaries/responsibility/interfaces
Portable & Interoperable: share nothing/minimum, technology adapters,
diverse infrastructure, diverse applications, extensible
Refine Design + Implementation (Plan of attack)
Driven by Real business/use cases
Platform for Big Data Analytics and Visual Analytics
Goals + Progress
Tools to empower scientists
7. Platform for Big Data Analytics and Visual Analytics
What is “Big Data” processing?
“Python is like the jazz movement in machine learning to R is like classical music.”
8. Definition:
Collection of data sets so large and complex that it becomes difficult
to process using traditional data processing applications (Wikipedia)
Simple right? But private sector has the loudest Big Data voice.
Most popular tools and resources lean heavily towards:
Unstructured data, high number of small loosely related data elements
Hadoop, HDFS, NoSQL, Hadoop, HDFS, NoSQL, Hadoop, HDFS… etc.
Platform for Big Data Analytics and Visual Analytics
Big Data definition vs discussion?
Understand
9. Some science problems fit the commercial mold. Many don’t e.g:
Highly regular and structured data samples
Single large datasets of tightly coupled samples
Streaming data from sensors
Getting data from domain specific data cubes
Right tools do exists, just not as visible in the community
Which ones do we need?
How do we integrate them with popular tools?
Can we still use commercially driven tools for science problems that
break the mold???
Platform for Big Data Analytics and Visual Analytics
Big Data: where does science fit?
Understand
10. Definition:
The discovery and communication of meaningful patterns in data
(Wikipedia)
Wow that’s broad! But commercial world has loudest voice again:
Analytics = [predictions] used to recommend action or to guide decision
making rooted in business context (Wikipedia)
Fortunately, this requires tools commonly used in science also:
data modeling, machine learning, optimization algorithms, visualisation etc.
Platform for Big Data Analytics and Visual Analytics
Analytics definition vs discussion?
Understand
11. Who is REALY doing Big Data? What are their needs?
Application/tools
– Linear Algebra? Machine Learning? Image Processing? Text/pattern
matching/mining?
Data
– Streaming vs Persistent+(Static||Dynamic)? Unstructured vs Structured? SQL vs
NoSQL vs Text vs Binary
Human Workflow
– Prototype vs Production, Exploratory vs Directed, Interactive vs Batch
– Scale code+tools from Interactive+Prototype => Production+Batch
What will they need to work on? How much can we support!?
CSIRO infrastructure: Storage + Compute
– Where is (should be) the data? Don’t move it!!
– What/Where is the compute?
Possible?? Transparency + Interoperability + Portability over Infrastructure
– HPC + Internal Cloud + Dedicated System
Platform for Big Data Analytics and Visual Analytics
The punters
want this!
Scientist and CSIRO specific needs
Understand
12. What is out there?
What delivers our scientists requirements?
Does it support CSIRO infrastructure?
How does it all fit together?
Inter layer: Does product X work with product Y
Intra layer: Can data stored by A be easily abstracted/ingested by B
Platform for Big Data Analytics and Visual Analytics
Tools and Software Landscape
Understand
13. Data A, B, C + Infrastructure 1, 2, 4 + Tool/Software α, β, γ + Science
App/Domain l, m, n
How to deal with Complexity!!!
1. Define the Forest
2. Map the Trees to Forest
3. Pick which Trees to keep/use
Platform for Big Data Analytics and Visual Analytics
Seeing the Forest from the Trees
Design
14. Platform for Big Data Analytics and Visual Analytics
Seeing the Forest from the Trees
Design
15. Big Data: Petabytes
Storage of low-value data
H/W failure common
Code: frequency, graphs, machine-
learning, rendering
Ingress/egress problems
Dense storage of data
Mix CPU and data
Spindle:core ratio
HPC: Petaflops
Storage for checkpointing
Surprised by H/W failure
Code: simulation, rendering
Less persistent data, ingress &
egress
Dense compute
CPU + GPU
Bandwidth to other servers
• Failure is inevitable fault tolerance build-in
• Bandwidth and IO is precious topology aware scheduling
• Linear scalability massive parallelisation, minimal communication
• Hide the complexities from developers expressive programming model
Platform for Big Data Analytics and Visual Analytics
Big Data versus HPC
Understand
16. Platform for Big Data Analytics and Visual Analytics
• Become Big Data Excellence Centre with the
vision/mission to be a hub for big data analytics
and processing technology and provide
technical expertise in this area.
• Achieve a step change in the size of big data
problems that are being tackled in CSIRO.
• Decrease the effort and time required for CSIRO
to discover new patterns in Massive datasets.
• Simplify Scientist’s workflows with big data set.
• Develop solution architectures and software
components to support specific needs of big
data processing and visualisation in CSIRO.
• Deliver CSIRO shared "big data facility”
supporting integration and processing data
from different data sources. That would be
more of an infrastructure project that built
together with IM&T (Bowen Clouds) for certain
types of in-house big data processing scenarios.
Platform for Big Data Analytics and Visual Analytics
Vision
17. • Connect data analytics, simulations,
statistical modeling, image & video
analytics, machine learning, visualisation
into one stack of reusable solutions
supporting various science domains.
• Build more interactive solutions that
connect users with analytical models to
improve business decisions.
• Create new business cases.
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Mission
18. • Uptake of the technology in
CSIRO, transforming the way we
do science.
• Contribution to Big Data
Science globally.
• International collaborations.
• Enable new discoveries.
• Reduce time to new discovery.
• Global outreach.
• External grants, engagements
with industry.
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Success factors
19. • Data discovery
• Quantitative visualisation focus:
• Measurement on visualisation
• Uncertainty - from data to display
• Integration
• Interaction
• Views of the data
• Collaboration across virtual
environments
• Annotated 3D videos
• Augmented Reality
• Immersive Virtual Reality
• Wearables + Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Platform for Big Data Analytics and Visual Analytics
Visual Analytics
RAVE @ NIST/USA
20. Platform for Big Data and Visual Analytics
Our project is orientated at providing incremental, use-case driven development of
technical capabilities including skills, software and infrastructure to facilitate
scientists’ access to big data processing
Come talk to us!
https://wiki.csiro.au/display/bigdata/PBDAVA+Collaboration
Platform for Big Data Analytics and Visual Analytics
21. Funded from CAPEX & build in collaboration with IM&T
Deployed on Bowen Cloud
16 nodes each:
128GB RAM and 16 CPU cores
Infiniband network
~100 TB of storage (planned)
Various storage options being consider: OSM/NFS HDFS, GPFS+FPO
YARN cluster (CDH5) : Hadoop MR, Spark, h2o … (any YARN compatible
framework)
Status: storage testing
For more see:
https://wiki.csiro.au/display/ICTCRC/DP+Research+Big+Data+Cluster
The DB Research Big Data Cluster is a dedicated hardware cluster intended both to
support big data related computer science research and to provide experimental
big data processing capabilities for scientific projects within DP.
Platform for Big Data Analytics and Visual Analytics
DP Big Data Cluster
22. OSM/NFS
DP Big Data Cluster - Architecture
GPFS DAS
Edge Node
Clients, Compiler,
Staging, Monitor
Bowen
Storage
Worker nodes
Yarn Worker
HDFS Worker
Master nodes
Yarn Master
HDFS Master
Bowen
Compute
NexusAuthentication
GangliaMonitor
CSIRO Intranet
Workstations
hadoop1-01-cdchadoop1-{03..16}-cdchadoop1-02-cdc
Infiniband Network
Bragg, Pearcey
Platform for Big Data Analytics and Visual Analytics
23. Hadoop
What is it?
Platform for Big Data Analytics and Visual Analytics
● The Apache Hadoop is a framework that allows for
the distributed processing of large data sets across
cluster of computers using simple programming
models.
● Designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
● Designed to detect and handle failures at the
application layer.
http://hadoop.apache.org
24. Hadoop
Components
Platform for Big Data Analytics and Visual Analytics
● Hadoop components:
● Hadoop Distributed File System (HDFS)
● MapReduce
●Handles any data type
● Structured
● Unstructured
● Schema
● No schema
● High volume
● Low volume
25. Hadoop
Hadoop Distributed File System
Platform for Big Data Analytics and Visual Analytics
● Breaks incoming files into blocks and stores them
redundantly across the cluster
● A single large file is split into blocks, and the blocks
are distributed among the nodes
● Blocks in HDFS are large – typically 128MB in size
● Files in HDFS are ‘write ones’ (no random writes
allowed) and processed by MR framework. Results
stored back in HDFS.
● Original data file not modified during lifecycle
26. Hadoop
HDFS
Platform for Big Data Analytics and Visual Analytics
● Data replication (to enhance reliability and
availability) – default is threefold
● HDFS optimised for large, streaming reads of
files (rather than random reads)
● A master node NameNode keeps track
(metadata) of blocks that make a file and their
locations
27. Hadoop
Example
Platform for Big Data Analytics and Visual Analytics
● NameNode holds metadata for files
● DataNodes hold the actual blocks
28. MapReduce
Word count example
Platform for Big Data Analytics and Visual Analytics
Map: reads each line in the text
one at a time, splits out each
word into a separate string, and
for each word output the word
and a 1 to indicate it has seen the
word one time.
Shuffle: uses the word as the key,
hashing the records to reducers.
Reduce: sums up the number of
times each word was seen and
write that together with the word
as output.
29. Big Volume Processing
Architectures
– Share nothing
– Traditional: compute + storage
Parallel file systems
– HDFS, GPFS + FPO,
– S3, Swift, Lustre, Gluster
Processing
– Out of core (MapReduce)
– In memory
Scheduling:
– Yarn, Mesos
A programming model and an associated implementation for processing and generating
large data sets with a parallel, distributed algorithm on a cluster + a parallel filesystem
MapReduce
Model
DAG Model Graph Model
BSP/Collectiv
e Model
Twister
Hadoop
MPI
Drya
d
Spark
Giraph
Hama
GraphLab
Harp
GraphX
HaLoop
Stratosphere
Reef
Iterative
Platform for Big Data Analytics and Visual Analytics
30. Pig
Philosophy
● Pigs eat anything
○ Input data can come in any format – popular formats, such
as tab-delimited are natively supported. Users can add
functions to support other data formats.
○ Operates on data: relational, nested, semi-structured, or
unstructured
● Pigs live anywhere
● Pigs are domestic animals
● Pigs fly
○ Pig processes data quickly.
Platform for Big Data Analytics and Visual Analytics
31. Pig
What is it?
● Pig provides an engine for executing data flows in parallel on Hadoop
● Pig includes a language called Pig Latin for expressing data flows
● Pig Latin includes operators for many of the traditional data operations (not
to be re-invented as in Hadoop): JOIN, SORT, FILTER, FOREACH, GROUP, LOAD
and STORE.
● Pig makes use of: the Hadoop Distributed File System (HDFS) and processing
system MapReduce
Why?
Faster Development (increases productivity 10x), Flexible,
Express data transformation tasks in just a few lines of code
Don’t reinvent the wheel, 10 lines of Pig Latin = ~200 lines of Java
Platform for Big Data Analytics and Visual Analytics
32. Pig
Workflow
● A LOAD statement reads data from the file system.
● A series of transformation statements process the data.
● A STORE statement writes output to the file system or, a DUMP
statement displays output to the screen.
● Pig always at first validates the syntax and semantics of all
statements and execute them only when encounters DUMP or
STORE statements.
Platform for Big Data Analytics and Visual Analytics
34. Pig
Pig Latin
● Pig Latin is a dataflow language --> allows users to describe how data
from one or more inputs should be read, processed and stored to one or
more outputs in parallel.
● Data flows can be:
○ Linear: as in the word count example
○ Complex: multiple inputs are joined and where data is split into
multiple streams to be processed by different operators
● Pig Latin script describes a directed acyclic graph (DAG) where the edges
are data flows and the nodes are operators that process the data
● Pig Latin has no if statements or for loops (= it focuses on data flow)
○ Traditional procedural and OO programing languages describe control
flow; data flow is a side effect of the program.
Platform for Big Data Analytics and Visual Analytics
35. Pig
Running Pig / Starting Grunt
Platform for Big Data Analytics and Visual Analytics
● Pig supports local mode: useful for prototyping and debugging Pig
Latin scripts. Test on small data and move to large data.
● Pig also runs in mapreduce mode: it does parsing, checking and
planning locally, but executes MapReduce jobs on Hadoop cluster (it
needs to know where NameNode and JobTracker are located).
You can execute Pig Latin statements:
● Using command line / Grunt shell
● In local mode or mapreduce mode (to
interact with HDFS on your cluster)
● Either interactively or in batch
● Embedded Pig
37. Pig
Schemas
● Pig eats everything - lax attitude for schemas
● If schema for data is available, Pig will use it
● If schema for data is not available, Pig will process the data and will
make the best guesses (on how script treats data)
40. Pig
User Defined Functions (UDF)
Platform for Big Data Analytics and Visual Analytics
● Benefits
○ Use legacy code
○ Use library in scripting language
○ Leverage Hadoop for non-Java programmers
● Extensible Interface
○ Minimum effort to support another language
● Currently supported languages
○ Python
○ JavaScript
○ Ruby
41. Pig
DataFu
Platform for Big Data Analytics and Visual Analytics
● DataFu is a collection of user-defined functions for working with large-scale data
in Hadoop and Pig.
● This library was born out of the need for a stable, well-tested library of UDFs for
data mining and statistics.
● Used at LinkedIn in many of our off-line workflows for data derived products like
"People You May Know" and "Skills". It contains functions for:
○ PageRank
○ Statistics (e.g. quantiles, median, variance, etc.)
○ Sampling (e.g. weighted, reservoir, etc.)
○ Convenience bag functions (e.g. enumerating items)
○ Convenience utility function (e.g., assertions, etc.)
○ Set operations (intersect, union)
42. Pig
ABC Radio Stations and Toilets example
Platform for Big Data Analytics and Visual Analytics
● We have list of local ABC Radio
stations in Australia
● We have list of all Public Toilets
across Australia
● We want to find a closest toilet to a
Radio Station
Demonstration of:
● Data Schemas
● Use of external libraries
● Google Maps API
https://github.com/tomaszbednarz/pig-abc-toilets
43. Apache Spark
Fast, general engine for large-scale data processing and analysis
• Open source, developed at the UC Berkeley
• Written in Scala (functional programming language that runs in a JVM)
• Key Concepts
• Avoid the data bottleneck by distributing data when it is stored
• Bring the processing to the data
• Data is stored in memory
• Improves efficiency through (up to 100x faster):
In-memory computing primitives
General computation graphs
• Improves usability through:
Rich APIs in Java, Scala, Python
Interactive shell in Python, Scala
Up to 2-10x less code
Platform for Big Data Analytics and Visual Analytics
API
Spark
Cluster Computing
• Spark Standalone
• YARN
• Mesos
Storage
HDFS
44. Apache Spark
RDD (Resilient Distributed Dataset)
• RDD (Resilient Distributed Dataset)
• Resilient – if data in memory is lost, it can be recreated
• Distributed – stored in memory across the cluster
• Dataset – initial data can come from a file or created programmaticaly
• RDDs are the fundamental unit of data in Spark
• Concept: Resilient Distributed Datasets (RDDs)
Immutable collections of objects spread across a cluster
Built through parallel transformations (map, filter, etc)
Automatically rebuilt on failure
Controllable persistence (e.g. caching in RAM)
Platform for Big Data Analytics and Visual Analytics
From “Parallel Programming with Spark”
by Matei Zaharia, UC Berkeley
45. Operations
Two types: transformation and actions
Transformations (e.g. map, filter, groupBy, join, flatMap)
Lazy operations to build RDDs from other RDDs
Actions (e.g. count, collect, reduce)
Return a result or write it to storage
From “Parallel Programming with Spark”
by Matei Zaharia, UC Berkeley
Platform for Big Data Analytics and Visual Analytics
RDDs can hold any type of element:
- Primitive types:
- Integers, characters, strings, etc.
- Sequence types:
- Lists, arrays, dics, etc.
- Scala/Java Objects
- Mixed types
46. Apache Spark
API
Platform for Big Data Analytics and Visual Analytics
http://www.slideshare.net/frodriguezolivera/apache-spark-41601032
47. lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then interactively search for
patterns
From “Parallel Programming with Spark”
by Matei Zaharia, UC Berkeley
48. QI group has developed algorithm to
extract significant frames
A single 30 day trip produces 720 hours or
180 GB of video footage – single CPU
processing takes about 9 hours
We developed Sparkle
prototype integration of SPARK and OpenCV
video reduction tool on top of Sparkle
Results
Processing (reduction) of 256 x 0.5GB =
128GB video files on bragg with SPARK-HPC
Resources requested: 128 nodes with 4
process per node = 512 CPU cores
Execution time: 137s
Automated Big Video Analysis
Integrated video camera systems have been installed on fishing boats to trial for
the 24/7 fishery monitoring of tuna longline operations in Australia.
Platform for Big Data Analytics and Visual Analytics
49. WebVR: Virtual Reality in Web Browsers
collaboration with NIST (Sandy Ressler)
Platform for Big Data Analytics and Visual Analytics
50. SPARK-HPC
SPARK-HPC is an open-source adapter for running Spark on PBS
clusters
Well suited for compute and memory intensive applications (e.g.,
large scale machine learning)
Enables Spark computation on CSIRO HPC clusters including bragg
(128 Dual Xeon 8-core E5-2650 nodes with 384 Kepler Tesla K20
GPUs)
Open-source see: https://github.com/csirobigdata/spark-hpc
Status on CSIRO HPC Clusters:
Needs to be migrated to SLURM and redeployed
Platform for Big Data Analytics and Visual Analytics
52. For even more discussions
Directions
• Connect Big Data and Science
• Infrastructure
• Data Provenance
• How to link data centers together
• Visual Analytics
• Real time data processing
• Internet of Things
• Art + Science: communication
• Spark + GPUs
http://devblogs.nvidia.com/parallelf
orall/bidmach-machine-learning-
limit-gpus/
Platform for Big Data Analytics and Visual Analytics
53. Thank you
CONTACT Tomasz Bednarz
E: tomasz.bednarz@csiro.au
T: (07) 3833 5544
CSIRO DIGITAL PRODUCTIVITY FLAGSHIP
Editor's Notes
You write a single program similar to DryadLINQ
Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops
Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization
Mention cached vars useful for some workloads that won’t be shown here
Mention it’s all designed to be easy to distribute in a fault-tolerant fashion
Key idea: add “variables” to the “functions” in functional programming