The original, possibly unreadable, rainbow themed presentation on thinking about how to aggregate data using a processing framework of your choice (MapReduce is the one I used) and Accumulo's Iterators.
Learn the basics of data visualization in R. In this module, we explore the Graphics package and learn to build basic plots in R. In addition, learn to add title, axis labels and range. Modify the color, font and font size. Add text annotations and combine multiple plots. Finally, learn how to save the plots in different formats.
Conversation with-search-engines (Ren et al. 2020)Vaclav Kosar
To make voice search more natural, this paper compiles new dataset and implements two novel modules into QA architecture.
Authors: Pengjie Ren, Zhumin Chen, Zhaochun Ren, Evangelos Kanoulas, Christof Monz, Maarten de Rijke
+ Other mentioned papers
Learn the basics of data visualization in R. In this module, we explore the Graphics package and learn to build basic plots in R. In addition, learn to add title, axis labels and range. Modify the color, font and font size. Add text annotations and combine multiple plots. Finally, learn how to save the plots in different formats.
Conversation with-search-engines (Ren et al. 2020)Vaclav Kosar
To make voice search more natural, this paper compiles new dataset and implements two novel modules into QA architecture.
Authors: Pengjie Ren, Zhumin Chen, Zhaochun Ren, Evangelos Kanoulas, Christof Monz, Maarten de Rijke
+ Other mentioned papers
Data Visualization With R: Learn To Combine Multiple GraphsRsquared Academy
In this tutorial, we learn to combine multiple graphs into a single frame using the par() and layout() functions. We also compare the differences between the two functions.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Architecture for scalable Angular applicationsPaweł Żurowski
Architecture for applications that scales. It uses redux pattern and ngrx implementation with effects and store.
It's refreshed (but still 2+) presentation from my inner talk for colegues.
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
Besides adaptive joins and adaptive parallel distribution, 12c comes with Adaptive Bitmap Pruning. I’ll describe the case it applies to and which is often not well known: the Star Transformation
Accumulo Collections is a lightweight library that dramatically simplifies development of fast NoSQL applications by encapsulating many powerful, distributed features of Accumulo in the familiar Java Collections interface. Accumulo is a giant sorted map with rich server-side functionality, and our AccumuloSortedMap is a robust java SortedMap implementation that is backed by an Accumulo table. It handles serialization and foreign keys, and provides extensive server-side features like entry timeout, aggregates, filtering, efficient one-to-many mapping, partitioning and sampling. Users can define custom server-side transformations and aggregates with Accumulo iterators.
More information on this project can be found on github at: https://github.com/isentropy/accumulo-collections/wiki
– Speaker –
Jonathan Wolff
Founder, Director of Engineering, Isentropy LLC
Jonathan is an ex-physicist who operates a consultancy specializing in big data and data science project work. He worked for Bloomberg last year and built their Accumulo File System, which was presented as 2015 Accumulo Summit's keynote speech. He's also done distributed computing project work for Yahoo! in Pig.
Jonathan holds a BA in Physics (Harvard, magna cum laude 2001) and an MS in Mechanical Engineering (Columbia, 2003), and has been avidly programming since the 1980's.
— More Information —
For more information see http://www.accumulosummit.com/
Data Visualization With R: Learn To Combine Multiple GraphsRsquared Academy
In this tutorial, we learn to combine multiple graphs into a single frame using the par() and layout() functions. We also compare the differences between the two functions.
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Hoje em dia é fácil juntar quantidades absurdamente grandes de dados. Mas, uma vez de posse deles, como fazer para extrair informações dessas montanhas amorfas de dados? Nesse minicurso vamos apresentar o modelo de programação MapReduce: entender como ele funciona, para que serve e como construir aplicações usando-o. Vamos ver também como usar o Elastic MapReduce, o serviço da Amazon que cria clusters MapReduce sob-demanda, para que você não se preocupe em administrar e conseguir acesso a um cluster de máquinas, mas em como fazer seu código digerir de forma distribuída os dados que você possui. Veremos exemplos práticos em ação e codificaremos juntos alguns desafios.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Architecture for scalable Angular applicationsPaweł Żurowski
Architecture for applications that scales. It uses redux pattern and ngrx implementation with effects and store.
It's refreshed (but still 2+) presentation from my inner talk for colegues.
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
Besides adaptive joins and adaptive parallel distribution, 12c comes with Adaptive Bitmap Pruning. I’ll describe the case it applies to and which is often not well known: the Star Transformation
Accumulo Collections is a lightweight library that dramatically simplifies development of fast NoSQL applications by encapsulating many powerful, distributed features of Accumulo in the familiar Java Collections interface. Accumulo is a giant sorted map with rich server-side functionality, and our AccumuloSortedMap is a robust java SortedMap implementation that is backed by an Accumulo table. It handles serialization and foreign keys, and provides extensive server-side features like entry timeout, aggregates, filtering, efficient one-to-many mapping, partitioning and sampling. Users can define custom server-side transformations and aggregates with Accumulo iterators.
More information on this project can be found on github at: https://github.com/isentropy/accumulo-collections/wiki
– Speaker –
Jonathan Wolff
Founder, Director of Engineering, Isentropy LLC
Jonathan is an ex-physicist who operates a consultancy specializing in big data and data science project work. He worked for Bloomberg last year and built their Accumulo File System, which was presented as 2015 Accumulo Summit's keynote speech. He's also done distributed computing project work for Yahoo! in Pig.
Jonathan holds a BA in Physics (Harvard, magna cum laude 2001) and an MS in Mechanical Engineering (Columbia, 2003), and has been avidly programming since the 1980's.
— More Information —
For more information see http://www.accumulosummit.com/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
2. Have you heard of...
… TSAR, Summingbird? (Twitter)
… Mesa? (Google)
… Commutative Replicated Data Types?
Describe a system that pre-computes
aggregations over large datasets using
associative and/or commutative functions.
3. What do we need to pull this off?
We need data structures that can be combined
together. Numbers are a trivial example of this,
as we can combine two numbers using a
function (such as plus and multiply). There are
more advanced data structures such as
matrices, HyperLogLogPlus, StreamSummary
(used for top-k) and Bloom filters that also have
this property!
val partial: T = op(a, b)
4. What do we need to pull this off?
We need operations that can be performed in
parallel. Associative operations are espoused
by Twitter, but for our case operations that are
both associative and commutative have the
better property that we can get correct results
no matter what order we receive the data.
Common operations that are associative
(summation, set building) are also
commutative.
op(op(a, b), c) == op(a, op(b, c))
op(a, b) == op(b, a)
5. Wait a minute isn't that...
You caught me! It's a commutative monoid!
From Wolfram:
Monoid: A monoid is a set that is closed under
an associative binary operation and has an
identity element I in S such that for all a in S,
Ia=aI=a
Commutative Monoid: A monoid that is
commutative i.e., a monoid M such that for
every two elements a and b in M, ab=ba.
6. Put it to work
The example we're about to see uses
MapReduce and Accumulo. The same can be
accomplished using any processing framework
that supports map and reduce operations, such
as Spark or Storm's Trident interface.
7. We need two functions...
Map
– Takes an input datum and turns into some
combinable structure
– Like parsing strings to numbers, or creating single
element sets for combining
Reduce
– Combines the merge-able data structures using our
associative and commutative function
8. Yup, that's all!
● Map will be called on the input data once in a
Mapper instance.
● Reduce will be called in a Combiner, Reducer
and an Accumulo Iterator!
● The Accumulo Iterator is configured to run on
major compactions, minor compactions, and
scans
● That's five places the same piece of code gets
run-- talk about modularity!
9. What does our Accumulo Iterator
look like?
● We can re-use Accumulo's Combiner type here:
override def reduce:(key: Key, values: Iterator[Value])
Value = {
// deserialize and combine all intermediate
// values. This logic should be identical to
// what is in the mr.Combiner and Reducer
}
● Our function has to be commutative because major
compactions will often pick smaller files to combine,
which means we only see discrete subsets of data in an
iterator invocation
10. Counting in practice (pt 1)
We've seen how to aggregate values together. What's the
best way to structure our data and query it?
Twitter's TSAR is a good starting point. It allows users to
declare what they want to aggregate:
Aggregate(
onKeys((“origin”, “destination”))
producing(Count))
This describes generating an edge between two cities and
calculating a weight for it.
11. Counting in practice (pt 2)
With that declaration, we can infer that the user wants their
operation to be summing over each instance of a given pairing,
so we can say the base value is 1 (sounds a bit like word
count, huh?). We need a key for each base value and partial
computation to be reduced with. For this simple pairing we can
have a schema like:
<field_1>0<value_1>0...<field_n>0<value_n> count:
“” [] <serialized long>
I recently traveled from Baltimore to Denver. Here's what that
trip would look like:
origin0bwi0destination0dia count: “” [] x01
12. Counting in practice (pt 3)
● Iterator combines all values that are mapped to
the same key
● We encoded the aggregation function into the
column family of the key
– We can arbitrarily add new aggregate functions by
updating a mapping of column family to function
and then updating the iterator deployment
13. Something more than counting
● Everybody counts, but what about something
like top-k?
● The key schema isn't flexible enough to show a
relationship between two fields
● We want to know the top-k relationship
between origin and destination cities
● That column qualifier was looking awfully
blank. It'd be a shame if someone were to put
data in it...
14. How you like me now?
● Aggregate(
onKeys((“origin”))
producing(TopK(“destination”)))
● <field1>0<value1>0...<fieldN>0<valueN>
<op>: <relation> [] <serialized data structure>
● Let's use my Baltimore->Denver trip as an
example:
origin0BWI topk: destination [] {“DIA”: 1}
15. But how do I query it?
● This schema is really geared towards point
queries
● Users would know exactly which dimensions
they were querying across to get an answer
– BUENO “What are the top-k destinations for Bill
when he leaves BWI?”
– NO BUENO “What are all the dimensions and
aggregations I have for Bill?”
16. Ruminate on this
● Prepare functions
– Preparing the input to do things like time bucketing and
normalization (Jared Winick's Trendulo)
● Age off
– Combining down to a single value means that value represents all
historical data. Maybe we don't care about that and would like to
age off data after a day/week/month/year. Mesa's batch Ids
could be of use here.
● Security labels
– Notice how I deftly avoided this topic. We should be able to
bucket aggregations based on visibility, but we need a way to
express the best way to handle this. Maybe just preserve the
input data's security labeling and attach it to the output of our
map function?
17. FIN
(hope this wasn't too hard to read)
Comments, suggestions or inflammatory messages should be
sent to @BillSlacum or wslacum@gmail.com