This document introduces Trumania, an open source Python library for generating realistic synthetic data. Trumania uses scenarios, populations, stories, and generators to simulate data. Populations contain dimensional data and relationships while stories execute random operations to produce event logs. Trumania aims to address the need for test data by generating diverse, correlated datasets without using real data. The document provides examples of how to define populations, generators, and stories in Trumania and notes that the project is now open source.
An Introduction to Cyber Law - I.T. Act 2000 (India)Chetan Bharadwaj
An Introduction to Cyber Law - Chetan Bharadwaj
The modern thief can steal more with a computer than with a gun. Tomorrow's terrorist may be able to do more damage with a keyboard than with a bomb.
Accountants, Erick Cutler and Jerry Murray, from the Dallas firm Goldin Peiser & Peiser, LLP spoke at the Arlington Dental Study Club November 17th. on the topic of fraud and embezzlement in dental practices. The purpose of the presentation was to raise dentists' awareness of fraud and provide information and ideas the attendees could take back to their practices. Contact Goldin Peiser & Peiser, LLP for more information or visit www.gppcpa.com.
The presentation provides overall insight of operational fraud risk management. It explains the operational fraud risk and mitigation strategies. The role of Internal audit and audit committee is further exemplified
An Introduction to Cyber Law - I.T. Act 2000 (India)Chetan Bharadwaj
An Introduction to Cyber Law - Chetan Bharadwaj
The modern thief can steal more with a computer than with a gun. Tomorrow's terrorist may be able to do more damage with a keyboard than with a bomb.
Accountants, Erick Cutler and Jerry Murray, from the Dallas firm Goldin Peiser & Peiser, LLP spoke at the Arlington Dental Study Club November 17th. on the topic of fraud and embezzlement in dental practices. The purpose of the presentation was to raise dentists' awareness of fraud and provide information and ideas the attendees could take back to their practices. Contact Goldin Peiser & Peiser, LLP for more information or visit www.gppcpa.com.
The presentation provides overall insight of operational fraud risk management. It explains the operational fraud risk and mitigation strategies. The role of Internal audit and audit committee is further exemplified
Multidisciplinary Journal Supported by TETFund. The journals would publish papers covering a wide range of subjects in journal science, management science, educational, agricultural, architectural, accounting and finance, business administration, entrepreneurship, business education, all journals
Credit risks are calculated based on the borrowers’ overall ability to repay. Our objective was to use optimization in order to create a tool that approves or rejects loans to borrowers. We also used optimization to establish how much interest rate/credit will be extended to borrowers who were approved for a loan.
Cyber Law and Information Technology Act 2000 with case studiesSneha J Chouhan
This presentation breifs about the Information Technology Act and Cyber Law in India 2000. The various acts involved in it, case studies and some recent amendments are also mentioned.
P.S: Refer the slides for educational purpose only.
Streaming data presents new challenges for statistics and machine learning on extremely large data sets. Tools such as Apache Storm, a stream processing framework, can power range of data analytics but lack advanced statistical capabilities. These slides are from the Apache.con talk, which discussed developing streaming algorithms with the flexibility of both Storm and R, a statistical programming language.
At the talk I dicsussed issues of why and how to use Storm and R to develop streaming algorithms; in particular I focused on:
• Streaming algorithms
• Online machine learning algorithms
• Use cases showing how to process hundreds of millions of events a day in (near) real time
See: https://apacheconna2015.sched.org/event/09f5a1cc372860b008bce09e15a034c4#.VUf7wxOUd5o
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Multidisciplinary Journal Supported by TETFund. The journals would publish papers covering a wide range of subjects in journal science, management science, educational, agricultural, architectural, accounting and finance, business administration, entrepreneurship, business education, all journals
Credit risks are calculated based on the borrowers’ overall ability to repay. Our objective was to use optimization in order to create a tool that approves or rejects loans to borrowers. We also used optimization to establish how much interest rate/credit will be extended to borrowers who were approved for a loan.
Cyber Law and Information Technology Act 2000 with case studiesSneha J Chouhan
This presentation breifs about the Information Technology Act and Cyber Law in India 2000. The various acts involved in it, case studies and some recent amendments are also mentioned.
P.S: Refer the slides for educational purpose only.
Streaming data presents new challenges for statistics and machine learning on extremely large data sets. Tools such as Apache Storm, a stream processing framework, can power range of data analytics but lack advanced statistical capabilities. These slides are from the Apache.con talk, which discussed developing streaming algorithms with the flexibility of both Storm and R, a statistical programming language.
At the talk I dicsussed issues of why and how to use Storm and R to develop streaming algorithms; in particular I focused on:
• Streaming algorithms
• Online machine learning algorithms
• Use cases showing how to process hundreds of millions of events a day in (near) real time
See: https://apacheconna2015.sched.org/event/09f5a1cc372860b008bce09e15a034c4#.VUf7wxOUd5o
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Christopher Gutteridge's slides form Connected Data London. Christopher, who is an Open Data Architect at the Univeristy of Southhampton presented why and how people should employ an Open Data strategy at their organisation.
MITRE ATT&CKcon 2018: Hunters ATT&CKing with the Data, Roberto Rodriguez, Spe...MITRE - ATT&CKcon
With the development of the MITRE ATT&CK framework and its categorization of adversary activity during the attack cycle, understanding what to hunt for has become easier and more efficient than ever. However, organizations are still struggling to understand how they can prioritize the development of hunt hypothesis, assess their current security posture, and develop the right analytics with the help of ATT&CK. Even though there are several ways to utilize ATT&CK to accomplish those goals, there are only a few that are focusing primarily on the data that is currently being collected to drive the success of a hunt program.
This presentation shows how organizations can benefit from mapping their current visibility from a data perspective to the ATT&CK framework. It focuses on how to identify, document, standardize and model current available data to enhance a hunt program. It presents an updated ThreatHunter-Playbook, a Kibana ATT&CK dashboard, a new project named Open Source Security Events Metadata known as OSSEM and expands on the “data sources” section already provided by ATT&CK on most of the documented adversarial techniques.
An introductory talk on scientific computing in Python. Statistics, probability and linear algebra, are important aspects of computing/computer modeling and the same is covered here.
Python is dominating the fast-growing data-science landscape. This talk provides a foundational overview of the practice of data science and some of the most popular Python libraries for doing data science. It also provides an overview of how Anaconda brings it all together.
Data Science at Scale - The DevOps ApproachMihai Criveti
DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
Science has escaped the lab and is roaming free in the world. People use software to understand the world . What tools are needed to support that work?
Threat Hunting with Elastic at SpectorOps: Welcome to HELKElasticsearch
HELK offers another approach for advanced cyber-hunting analytics, focusing on the importance of data documentation, quality, and modeling when developing analytics and making sense of disparate data sources inside the contested environment.
I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.
Nubank is the leading fintech in Latin America. Using bleeding-edge technology, design, and data, the company aims to fight complexity and empower people to take control of their finances. We are disrupting an outdated and bureaucratic system by building a simple, safe and 100% digital environment.
In order to succeed, we need to constantly make better decisions in the speed of insight, and that’s what We aim when building Nubank’s Data Platform. In this talk we want to explore and share the guiding principles and how we created an automated, scalable, declarative and self-service platform that has more than 200 contributors, mostly non-technical, to build 8 thousand distinct datasets, ingesting data from 800 databases, leveraging Apache Spark expressiveness and scalability.
The topics we want to explore are:
– Making data-ingestion a no-brainer when creating new services
– Reducing the cycle time to deploy new Datasets and Machine Learning models to production
– Closing the loop and leverage knowledge processed in the analytical environment to take decisions in production
– Providing the perfect level of abstraction to users
You will get from this talk:
– Our love for ‘The Log’ and how we use it to decouple databases from its schema and distribute the work to keep schemas up to date to the entire team.
– How we made data ingestion so simple using Kafka Streams that teams stopped using databases for analytical data.
– The huge benefits of relying on the DataFrame API to create datasets which made possible having tests end-to-end verifying that the 8000 datasets work without even running a Spark Job and much more.
– The importance of creating the right amount of abstractions and restrictions to have the power to optimize.
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
Katarzyna Orzechowska, Data Scientist (ING Tech)
Mariusz Derela, DevOps Engineer (ING Tech)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Big&open data challenges for smartcity-PIC2014 ShanghaiVictoria López
This talk is about how both private enterprise and government wish to improve the value of their data and how they deal with this issue. The talk summarizes the ways we think about Big Data, Open Data and their use by organizations or individuals. Big Data is explained in terms of collection, storage, analysis and valuation. This data is collected from numerous sources including networks of sensors, government data holdings, company market databases, and public profiles on social networking sites. Organizations use many data analysis techniques to study both structured and unstructured data. Due to volume, velocity and variety of data, some specific techniques have been developed. MapReduce, Hadoop and other related as RHadoop are trendy topics nowadays.
In this talk several applications and case studies are presented as examples. Data which come from government sources must be open. Every day more and more cities and countries are opening their data. Open Data is then presented as a specific case of public data with a special role in Smartcity. The main goal of Big and Open Data in Smartcity is to develop systems which can be useful for citizens. In this sense RMap (Mapa de Recursos) is shown as an Open Data application, an open system for Madrid City Council, available for smartphones and totally developed by the researching group G-TeC (www.tecnologiaUCM.es).
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events.
So far this mostly a development experience, with frameworks such as Oracle Event Processing, Apache Storm or Spark Streaming. With Oracle Stream Analytics, analytics on event streams can be put in the hands of the business analyst. It simplifies the implementation of event processing solutions so that every business analyst is able to graphically and decleratively define event stream processing pipelines, without having to write a single line of code or continous query language (CQL). Event Processing is no longer “complex”! This session presents Oracle Stream Analytics directly on some selected demo use cases.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the event streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and also used to be called Complex Event Processing (CEP). In the last 3 years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Apache Samza as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Event and Stream Processing and present what differences you might find between the more traditional CEP and the more modern Stream Processing solutions and show that a combination of both will bring the most value.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
Automate your Data Science pipeline with Ansible, Python and Kubernetes - ODSC Talk
What is Data Science and the Data Science Landscape
Process and Flow
Understanding Data
The Data Science Toolkit
The Big Data Challenge
Cloud Computing Solutions
The rise of DevOps in Data Science
Automate your data pipeline with Ansible
Much of the automating we do to support testing involves detecting change. Once our tests pass, they fail when the system changes and the automated execution alerts us to the change. There are other ways that automating can help us.
Similar to Trumania , a realistic scenario-based data-generator (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Trumania , a realistic scenario-based data-generator
1. Trumania, a realistic scenario-based data-generator
Svend Vanderveken
Leuven Data Science meetup - January 2018
2. 2
Real Impact Analytics
• Data analytics solutions for telecommunication operators
• https://realimpactanalytics.com
• We’re hiring :)
Gautier Krings
• Co-founder of Jetpack.AI
• http://jetpack.ai
Svend Vanderveken
• Freelance Data Engineer
• @svend_x4f
• https://sv3nd.github.io
About us
With some awesome contributions from:
● Thoralf Gutierrez
● Milan van der Meer
● Floran Hachez
3. 3
The problem
Data engineers and data scientists
need realistic test datasets
to validate the behaviour of data-processing applications
4. 4
The problem
Why such datasets are hard to get by:
● using existing data is often not allowed
● we need a great diversity of datasets to validate many
situations
7. 7
Existing solutions
Schema-based approach
● sufficient for many use cases
=> if you can, use that: it’s the simplest and the fastest
● caveat:
○ columns are often uncorrelated & dataset has no internal
structure
○ little/no use of empirical distributions
○ hard to manipulate in terms of cause and consequences
8. Existing solutions
8
Learning-based approaches
● fit a multivariate model to production data
● sample data from it
SDGen:
github.com/iostackproject/SDGen
Synthetic Data Vault:
dspace.mit.edu/handle/1721.1/109616
9. Existing solutions
9
Scenario/simulation based:
• Koen de Jonge Telcotraffic simulator
• cf MLGeek meetup of the 26th Oct 2016
• github.com/botkop/botkop-telcotraffic-simulator
Benchmark-based: TPC-DS
12. Trumania population
12
• Typically static / dimensional data (can be dynamic too)
• Similar approach to schema-based
• Correlated fields if necessary
14. Trumania generators
14
• Common interface for all random aspects of a Circus
• Essentially a thin wrapper around
• numpy
• faker
• empirical distribution
• ...bring your own distro
• Can be transformed and chained
16. Trumania population: real data too
16
Handy to combine real and random data inside a circus
distributors = population.load_from("/data/real_distributors.csv")
18. Trumania stories
18
• Executing a story produces the events
• Sequence of random or deterministic operations
• Made of:
• generators
• random traversal of weighted relationships
• population’s attribute lookups
• update of the Circus state
19. 19
duration_gen = ...
# outputs a time series with:
# PERSON_ID, CALLER_NAME, DURATION, CALLEE_ID, CALLEE_NAME, TIME
call_story.set_operations(
person_population.ops.lookup(
actor_id_field="PERSON_ID",
select={"NAME": "CALLER_NAME"}),
duration_gen.ops.generate(named_as="DURATION"),
person_population.get_relationship("friends").ops.select_one(
from_field="PERSON_ID", named_as="CALLEE_ID"),
person_population.ops.lookup(
actor_id_field="CALLEE_ID",
select={"NAME": "CALLEE_NAME"}),
clock.ops.timestamp(named_as="TIME")
)
20. More Trumania
20
• … and time profiles
• … and a circus persistence mechanism
• … and circus state updates
• ...
21. Trumania caveats
21
Some possible improvements:
• performance: python, pandas
• more I/O options (it's all local CSV for now)
• it’s a young tool ;)
22. Trumania open source
22
The project is open source as of today !
Code and scenario examples: github.com/RealImpactAnalytics/trumania
Documentation: realimpactanalytics.github.io/trumania
Slack trumania.slack.com
Clone it, try it, let us know what you think!
23. Brussels Office
5, Place du Champ de Mars
1050 Brussels
Belgium
Cape Town Office
34 Somerset Road
8005, Green Point, Cape Town
South Africa
São Paulo Office
93, Rua Doutor Andrade Pertence
Vila Olímpia, São Paulo
Brazil
Luxembourg Office
2 - L 2314 , Place de Paris
Luxembourg
Grand-Duchy of Luxembourg
Follow us:
www.realimpactanalytics.com
24. Legal notices and disclaimer
24
All rights reserved. No part of this document may be reproduced, utilized, stored in a
retrieval system, or transmitted in any form or by any means without the prior written
permission of Real Impact Analytics.
The information, including any analyses, numbers, images, and pricing data
contained in this document are non-binding and for discussion purposes only. As
such, they are subject to adjustments and/or modifications at the sole discretion of
Real Impact Analytics.
Any agreement is subject to the signature of a definitive final contract between Real
Impact Analytics and the recipient and the acceptance by the Recipient of Real
Impact Analytics’ terms and conditions.