#practical usage of regex | Explore Tumblr posts and blogs

fromdevcom · 26 days ago

Text

Pandas DataFrame Cleanup: Master the Art of Dropping Columns Data cleaning and preprocessing are crucial steps in any data analysis project. When working with pandas DataFrames in Python, you'll often encounter situations where you need to remove unnecessary columns to streamline your dataset. In this comprehensive guide, we'll explore various methods to drop columns in pandas, complete with practical examples and best practices. Understanding the Basics of Column Dropping Before diving into the methods, let's understand why we might need to drop columns: Remove irrelevant features that don't contribute to analysis Eliminate duplicate or redundant information Clean up data before model training Reduce memory usage for large datasets Method 1: Using drop() - The Most Common Approach The drop() method is the most straightforward way to remove columns from a DataFrame. Here's how to use it: pythonCopyimport pandas as pd # Create a sample DataFrame df = pd.DataFrame( 'name': ['John', 'Alice', 'Bob'], 'age': [25, 30, 35], 'city': ['New York', 'London', 'Paris'], 'temp_col': [1, 2, 3] ) # Drop a single column df = df.drop('temp_col', axis=1) # Drop multiple columns df = df.drop(['city', 'age'], axis=1) The axis=1 parameter indicates we're dropping columns (not rows). Remember that drop() returns a new DataFrame by default, so we need to reassign it or use inplace=True. Method 2: Using del Statement - The Quick Solution For quick, permanent column removal, you can use Python's del statement: pythonCopy# Delete a column using del del df['temp_col'] Note that this method modifies the DataFrame directly and cannot be undone. Use it with caution! Method 3: Drop Columns Using pop() - Remove and Return The pop() method removes a column and returns it, which can be useful when you want to store the removed column: pythonCopy# Remove and store a column removed_column = df.pop('temp_col') Advanced Column Dropping Techniques Dropping Multiple Columns with Pattern Matching Sometimes you need to drop columns based on patterns in their names: pythonCopy# Drop columns that start with 'temp_' df = df.drop(columns=df.filter(regex='^temp_').columns) # Drop columns that contain certain text df = df.drop(columns=df.filter(like='unused').columns) Conditional Column Dropping You might want to drop columns based on certain conditions: pythonCopy# Drop columns with more than 50% missing values threshold = len(df) * 0.5 df = df.dropna(axis=1, thresh=threshold) # Drop columns of specific data types df = df.select_dtypes(exclude=['object']) Best Practices for Dropping Columns Make a Copy First pythonCopydf_clean = df.copy() df_clean = df_clean.drop('column_name', axis=1) Use Column Lists for Multiple Drops pythonCopycolumns_to_drop = ['col1', 'col2', 'col3'] df = df.drop(columns=columns_to_drop) Error Handling pythonCopytry: df = df.drop('non_existent_column', axis=1) except KeyError: print("Column not found in DataFrame") Performance Considerations When working with large datasets, consider these performance tips: Use inplace=True to avoid creating copies: pythonCopydf.drop('column_name', axis=1, inplace=True) Drop multiple columns at once rather than one by one: pythonCopy# More efficient df.drop(['col1', 'col2', 'col3'], axis=1, inplace=True) # Less efficient df.drop('col1', axis=1, inplace=True) df.drop('col2', axis=1, inplace=True) df.drop('col3', axis=1, inplace=True) Common Pitfalls and Solutions Dropping Non-existent Columns pythonCopy# Use errors='ignore' to skip non-existent columns df = df.drop('missing_column', axis=1, errors='ignore') Chain Operations Safely pythonCopy# Use method chaining carefully df = (df.drop('col1', axis=1) .drop('col2', axis=1) .reset_index(drop=True)) Real-World Applications Let's look at a practical example of cleaning a dataset: pythonCopy# Load a messy dataset df = pd.read_csv('raw_data.csv')

# Clean up the DataFrame df_clean = (df.drop(columns=['unnamed_column', 'duplicate_info']) # Remove unnecessary columns .drop(columns=df.filter(regex='^temp_').columns) # Remove temporary columns .drop(columns=df.columns[df.isna().sum() > len(df)*0.5]) # Remove columns with >50% missing values ) Integration with Data Science Workflows When preparing data for machine learning: pythonCopy# Drop target variable from features X = df.drop('target_variable', axis=1) y = df['target_variable'] # Drop non-numeric columns for certain algorithms X = X.select_dtypes(include=['float64', 'int64']) Conclusion Mastering column dropping in pandas is essential for effective data preprocessing. Whether you're using the simple drop() method or implementing more complex pattern-based dropping, understanding these techniques will make your data cleaning process more efficient and reliable. Remember to always consider your specific use case when choosing a method, and don't forget to make backups of important data before making permanent changes to your DataFrame. Now you're equipped with all the knowledge needed to effectively manage columns in your pandas DataFrames. Happy data cleaning!

0 notes

sourabhchandrakarbooks3 · 1 year ago

Text

Top 10 Python Coding Secrets in 2023: A Journey through Sourabh Chandrakar Books

Python, a versatile and powerful programming language, continues to evolve, offering developers new tools and techniques to enhance their coding experience. In this article, we will explore the top 10 secret Python coding tips in 2023, drawing inspiration from the invaluable insights found in Sourabh Chandrakar books.

1.Mastering List Comprehensions:

In Sourabh Chandrakar books, you'll find a treasure trove of knowledge on list comprehensions. These concise and readable expressions allow you to create lists in a single line, boosting both efficiency and code elegance. Dive deep into his works to uncover advanced techniques for harnessing the full potential of list comprehensions.

2.Context Managers for Resource Management:

Proper resource management is crucial in Python development.Sourabh Chandrakar emphasizes the use of context managers, and his books delve into the intricacies of the with statement. Learn how to streamline your code and ensure efficient handling of resources like files and database connections.

3.Decorators Demystified:

SourabhChandrakarbooks provide an in-depth exploration of decorators, a powerful Python feature for modifying or extending functions and methods. Unlock the secrets of creating your own decorators to enhance code modularity and reusability.

4.Understanding Generators and Iterators:

Generators and iterators play a pivotal role in optimizing memory usage and enhancing performance. Sourabh Chandrakar insights into these concepts will equip you with the knowledge to write efficient, memory-friendly code.

5.Exploiting the Power of Regular Expressions:

Regular expressions are a potent tool for string manipulation and pattern matching. Sourabh Chandrakar books offer practical examples and tips for mastering regex in Python, enabling you to write robust and flexible code for text processing.

6.Optimizing Code with Cython:

Take your Python code to the next level by exploring the world of Cython. Chandrakar expertise in this area is evident in his books, guiding you through the process of integrating C-like performance into your Python applications.

7.Advanced Error Handling Techniques:

Sourabh Chandrakar places a strong emphasis on writing robust and error-tolerant code. Delve into his books to discover advanced error handling techniques, including custom exception classes and context-specific error messages.

8.Harnessing the Power of Enums:

Enums provide a clean and readable way to represent symbolic names for values. Sourabh Chandrakar's books shed light on leveraging Enums in Python to enhance code clarity and maintainability.

9.Mastering the asyncio Module:

Asynchronous programming is becoming increasingly important in modern Python development. Explore Chandrakar insights into the asyncio module, uncovering tips for efficient asynchronous code design.

10.The Art of Unit Testing:

Comprehensive unit testing is a hallmark of professional Python development. Sourabh Chandrakar books guide you through the art of writing effective unit tests, ensuring the reliability and maintainability of your codebase.

Conclusion:

In the dynamic world of Python development, staying ahead requires constant learning and exploration. Sourabh Chandrakar books serve as a valuable resource, offering deep insights into advanced Python coding techniques. By incorporating these top 10 secret Python coding tips into your skill set, you'll be well-equipped to tackle the challenges of 2023 and beyond. Happy coding!

1 note · View note

techandguru-blog · 5 years ago

Link

There are times when we do not know the exact item but we know how it looks like i.e. it has specific pattern and certain characteristics. So by just knowing the pattern, we can identify the items. In the same way, there are patterns to identify strings or set of strings in given text or file in java. For that, we have a REGULAR EXPRESSION in java. e.g. if we want to catch all email from the given text, we know how emails look like so we can define a pattern. We create a regex to represent that pattern. And performing pattern match on the given text, we can list all the emails in the given input text.

So regular expression is a special sequence of character that helps to match, find, edit other string or set of strings in the given input, using a specialized string held in so-called Pattern. The regular expression in java is provided through java.util.regex package. Java.util.regex primarily contains three classes name listed below

- Pattern Class: It is used to define the patterns for matching. An object of Pattern class represents a compiled representation of the regular expression. There is no public constructor available to create an object of Pattern class. To instantiate an object of Pattern class, one has to use any version of public static compile() method of Pattern class. These methods accept regular expression string as the first argument.

- Matcher Class: Matcher class is an engine to interpret the pattern of regular expression and performs the match on the input string. Matcher class too does not have any public constructor. To obtain an object of Matcher class, one has to use call matcher() method on Pattern Class object.

- PatternSyntaxException Class: A PatternSyntaxException class represents an unchecked exception that indicates a Syntax error in the regular expression.

CAPTURING GROUP in Regular Expression

The capturing group represents the group of the letter put together as a single unit. They are created by putting letters to be grouped in parentheses. e.g. (techie360).

Capturing groups are numbered by counting the opening parentheses from left to right. e.g ((t)(pq)) has capturing group in the order ((t)(pq)), (t), (pq).

To find the number of capturing group in the regular expression, just use groupCount() method on Matcher class object. Every capturing group contains group 0 which is not included in the count returned by groupCount().

Example of Capturing Group usage

import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMatches { public static void main( String args[] ) { // input String String line = "you are reading post on techie360!"; String pattern = "(.*)(\\d+)(.*)"; // Create a Pattern object Pattern p = Pattern.compile(pattern); // Now create matcher object. Matcher m = p.matcher(line); if (m.find( )) { System.out.println("Found value: " + m.group(0) ); System.out.println("Found value: " + m.group(1) ); System.out.println("Found value: " + m.group(2) ); }else { System.out.println("NO MATCH"); } } }

The output of the above program would be

Found value: you are reading post on techie360! Found value: you are reading post on techie360! Found value: 0

REGULAR EXPRESSION SYNTAX AND MEANING

In the below table, a complete list of regular expression letters are listed

Regex Meaning ^ Matches the beginning of the line. $ Matches the end of the line. . Matches any single character except a newline. Using m option allows it to match the newline as well. [...] Matches any single character in brackets. [^...] Matches any single character not in brackets. \A Beginning of the entire string. \z End of the entire string. \Z End of the entire string except for allowable final line terminator. re* Matches 0 or more occurrences of the preceding expression. re+ Matches 1 or more of the previous thing. re? Matches 0 or 1 occurrence of the preceding expression. re{ n} Matches exactly n number of occurrences of the preceding expression. re{ n,} Matches n or more occurrences of the preceding expression. re{ n, m} Matches at least n and at most m occurrences of the preceding expression. a| b Matches either a or b. (re) Groups regular expressions and remembers the matched text. (?: re) Groups regular expressions without remembering the matched text. (?> re) Matches the independent pattern without backtracking. \w Matches the word characters. \W Matches the nonword characters. \s Matches the whitespace. Equivalent to [\t\n\r\f]. \S Matches the non-whitespace. \d Matches the digits. Equivalent to [0-9]. \D Matches the non-digits. \A Matches the beginning of the string. \Z Matches the end of the string. If a newline exists, it matches just before newline. \z Matches the end of the string. \G Matches the point where the last match finished. \n Back-reference to capture group number "n". \b Matches the word boundaries when outside the brackets. Matches the backspace (0x08) when inside the brackets. \B Matches the nonword boundaries. \n, \t, etc. Matches newlines, carriage returns, tabs, etc. \Q Escape (quote) all characters up to \E. \E Ends quoting begun with \Q.

METHODS OF MATCHER CLASS

Matcher class methods can be divided into three categories basis the function they perform:

- Index Methods: index methods provide the index of match found in the input string. Below is the list of index methods:

Method Explanation public int start() Returns the start index of the previous match. public int start(int group) Returns the start index of the subsequence captured by the given group during the previous match operation. public int end() Returns the offset after the last character matched. public int end(int group) Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.

- Study Methods: these methods perform match on the input string and return whether the match is found or not. Please see below list for all Study methods:

Method Description Public boolean lookingAt() Attempts to match the input sequence, starting at the beginning of the region, against the pattern. public boolean find() Attempts to find the next subsequence of the input sequence that matches the pattern. public boolean find(int start) Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index. public boolean matches() Attempts to match the entire region against the pattern.

REPLACEMENT METHODS:

These methods perform replacement in the input string. Below are replacement methods

Method & Description public Matcher appendReplacement(StringBuffer sb, String replacement) Implements a non-terminal append-and-replace step. public StringBuffer appendTail(StringBuffer sb) Implements a terminal append-and-replace step. public String replaceAll(String replacement) Replaces every subsequence of the input sequence that matches the pattern with the given replacement string. public String replaceFirst(String replacement) Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string. public static String quoteReplacement(String s) Returns a literal replacement String for the specified String. This method produces a String that will work as a literal replacement s in the appendReplacement method of the Matcher class.

matches() and lookingAt() methods: Similarity and differences

- both methods match pattern in the input string

- both start matching at the start of input string

- matches() requires complete string to be matched but lookingAt() does not require the complete string to be matching.

To demonstrate the difference see the example below:

import java.util.regex.Pattern; import java.util.regex.Matcher; public class RegexMatches { private static final String REGEX = "too"; private static final String INPUT = "tooo"; private static Pattern pattern; private static Matcher matcher; public static void main( String args[] ) { pattern = Pattern.compile(REGEX); matcher = pattern.matcher(INPUT); System.out.println("REGEX is: "+REGEX); System.out.println("INPUT is: "+INPUT); System.out.println("lookingAt(): "+matcher.lookingAt()); System.out.println("matches(): "+matcher.matches()); } }

the output of the above program

REGEX is: foo INPUT is: fooooooooooooooooo lookingAt(): true matches(): false

- replaceFirst( ) replaces first matching occurrence and replaceAll() replaces all occurrences of the pattern matching.

So we understand how we can use regular expression in java for pattern matching. Regular expressions are quite a powerful tool in java to find, edit and replace the input string.

Hope you enjoyed the article, please share and subscribe to the latest article update.

#regular expression in java #regex in java #java regex #regex usages #pattern matching in java #regex keywords in java #practical usage of regex

1 note · View note

phoorn · 4 years ago

Text

A real programmer

They say a real programmer is lazy. They hate doing things so much they spend weeks writing tools that will save them minutes. Of course, sometimes they write tools that will save lots of people centuries.

What I’m getting at is I wrote a script today. It tells you the word count of each chapter in your story.

I say “chapter”, but really I mean –

What do you call it, when you, you know:

“Blah, blah, blah”, said Percy.

Little he know, the fool, that that blah was the last blah he would every blah.

On the following Tuesday I packed up my jimjams and set out into the Wild.

You don’t call that a chapter. I’ll die before I call it a scene (I’m not a director, and I don’t secretly wish I was one).

Well, anyway, my script counts up those. I wrote it in C. Would you like to C?

#include <err.h> #include "../wc/wc.h" #include "../cat/cat.h" #include "../split/split.h" int main (int argc, char **argv) { if (argc < 2) errx (1, "usage: %s file", *argv); char *buf; int ret = cat_paths_into_buf (&buf, (const char *[]) {argv[1]}, 1, 0); size_t n_bufs; char **bufs = split (buf, strchr (buf, '\0'), (char *[]) {"##"}, 1, &n_bufs, 1); struct wc wc; each (buf, &bufs[1], n_bufs - 1) { wc = wc_str (*buf); printf ("%zu: %zu\n", buf - bufs + 1 - 1, wc.words); } }

I’ll tell you some interesting things about this program.

1!!!!!!!!!!!!!!!!!!!!!!! (each)

I like to use a macro to loop over arrays. In C, traditionally you do

for (int i = 0; i < /\* eg \*/ 100; i++) { /\* eg \*/ printf ("%d\n", the_thing[i]) }; }

And I think that’s more universally useful. I think if you were going to have just one kind of for loop, that would be it. But I like the other kind, the kind you get in Bash and no doubt Python and all the rest: “for THING in THINGS”.

The each macro lets me do that. Actually I can do that without a macro, and did for a long time.

for (struct thing *thing = things; thing < things + n_things; thing++) { (*thing).whatever = "joy"; }

But that’s pretty painful to write because it makes your eyes bleed.

2!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I include the header files “wc.h”, “cat.h” and “split.h”. These are libs I’ve written. split.h is straightforward: split a string. cat and wc, though, might be interesting to talk about.

C’s a bad language to write little scripts like this in, or so, I feel, people believe. But I say screw you.

What makes C bad as scripting language is the standard library. It doesn’t have convenient functions. As an example there no function simply split a string. There’s strtok. It’s not really the same. There’s no join, launch_emacs, no nothin. There’s actually no dynamic array struct-plus-function. Maybe because no one would agree on the best way to do it?

The program I pasted: would it really be that much shorter in Python? I say NO! Not that much shorter. Not when you have libs like wc, cat. But you have to write those yourself.

You can use big general-purpose libraries like glib and maybe one day I will. I’ve never liked the idea of linking to such a big binary just to write programs like the one I. But I think that’s empty prejudice. Most scripts you write yourself in Bash are personal. I mean, they run on your machine, and glib (or whatever) is on your machine.

My wc and cat libs are based, you’ll know if you’re in the know, on Unix commands: wc (wordcount) and cat (concatenate – it reads in files, outputs them to the screen).

I’ve thought for a long time I’d love to see GNU’s Coreutils designed as C libraries. The commands would be frontends for those libraries. Because, like I say, the pain in C isn’t the language, it’s the low-level nature of its standard library.

Even other libraries are obnoxiously low-level. There’s a commonly-used library, PCRE (Perl-Compatible Regular Expressions). It’s very clever, well made, has loads of functions.

But it doesn’t have a “match” function! I swear, I’ve looked. There’s no function you can call to just figure out if a bit of text matches a pattern. Instead you’re expected to compile a regex, then execute it, checking for errors along the way. It about six lines for something that in in other languages is simply “==” or “=~”.

So, of course, the first thing you do when faced with that is wrap it in your own function, regmatch and forget about all the rest.

What else do I have to complain about?

Nothing. What a disgrace.

That was a pretty incoherent blog post, but I’ve overdone it on the coffee / energy drinks and have a headache.

I suppose I should say something about the story. I’ll tell you I’ve come up with a name: The Long, Thin Tail of God. I’ll also spurt out here the names I considered. Fast content, ker-splat, paste it in.

Home

Lonely

Marrowbone War

Burane Massacre

Big Rock on a Beach

Big House on the Hill

The Whore of Black Lake

To Jump on My Bed

Jumping At Every Sound

An Archaeologist

A Spy

It Wasn’t Sheer Bliss

Ice-cold home

As If I Was A Stranger

Xunotic

Xunophillia

Xunophobia

Because You’re not worthy, Soraen

Right in the Centre of Town

Yes, I Would Have Said

Keep the Fire Burning

Eternal Flame

What a Lot of Work it Must Take

The Burané Massacre.

The Ape-People

The Monkey God

The Strange Tale of the Long, Thin Tail

The Long, Thin tail – that’s a nice pun.

The Long, Thin Tail of God

I went with the Tail one for a very practical reason: the story is, in the end, about this monkey god, Xu. And there’s bugger all about him in story until near the end. I kind of liked that. I like stories that surprise me. I don’t have strong opinions about how stories should be.

But I know if I don’t drop heavy hints about the God before he’s mentioned at the end, people will say, “Huh? Where did he come from?”

He came from your mama! Now shut up and just enjoy the damn story!

0 notes

pakuniinfo · 5 years ago

Text

Natural Language Processing Books, Course Data and Tutorials

Natural Language Processing DefinationThe history of natural language processing generally started in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled ``Computing Machinery and Intelligence`` which proposed what is now called the Turing test as a criterion of intelligence. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed. #ultib3-9471533185d8dc537d7b9f { min-height:50px; } #ultib3-9471533185d8dc537d7b9f img.ultb3-img { height: px; } #ultib3-9471533185d8dc537d7b9f .ultb3-btn { border:2px solid #1e73be;border-radius:50px;color:#33000d; } #ultib3-9471533185d8dc537d7b9f .ultb3-btn:hover { background:#1e73be;color:#ffffff; } Course Contents for Natural Language Processings This Outline Will be similar with your University Course Outline Introduction and Overview, Ambiguity and uncertainty in language, Regular Expressions. Chomsky hierarchy, regular languages, and their limitations. Finite-state automata. Practical regular expressions for finding and counting language phenomena. A little morphology. In class demonstrations of exploring a large corpus with regex tools, String Edit Distance and Alignment, Key algorithmic tool: dynamic programming, first a simple example, then its use in optimal alignment of sequences. String edit operations, edit distance, and examples of use in spelling correction, and machine translation, Context-Free Grammars, Constituency, CFG definition, use, and limitations. Chomsky Normal Form. Top-down parsing; bottom-up parsing, and the problems with each. The desirability of combining evidence from both directions, Information Theory, What is information? Measuring it in bits. The "noisy channel model." The "Shannon game"--motivated by language! Entropy, cross-entropy, information gain. Its application to some language phenomena, Language modeling and Naive Bayes, Probabilistic language modeling and its applications. Markov models. N-grams. Estimating the probability of a word, and smoothing. Generative models of language. Their application to building an automatically-trained email spam filter, and automatically determining the language, Part of Speech Tagging and Hidden Markov Models, The concept of parts-of-speech, examples, usage. The Penn Treebank and Brown Corpus. Probabilistic (weighted) finite state automata. Hidden Markov models (HMMs), definition and use, Probabilistic Context-Free Grammars, Weighted context-free grammars, Maximum Entropy Classifiers, The maximum entropy principle, and its relation to maximum likelihood. The need in NLP to integratemany pieces of weak evidence. Maximum entropy classifiers and their application to document classification, sentence segmentation. Reference Materials Recommended By HEC 1. Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, 43 Computational Linguistics and Speech Recognition. Second Edition. Prentice Hall. 2. Foundations of Statistical Natural Language Processing, Manning and Schütze, MIT Press. Cambridge, MA: May 1999

PDF Books and Helping Material.

Option 1: Download DOWNLOAD Option 2: Download DOWNLOAD Option 3: Download DOWNLOAD

Get Youtube Videos

Option 1: Download or Watch Online https://www.youtube.com/watch?v=8S3qHHUKqYkqYk Option 2: Download or Watch Online Option 3: Download or Watch Online Get Free and Premium Courses and Books exclusive on Amazon, Khan Academy, Scribd Coursea, Bightthink, EDX and BrightStorm View More Close Get the Free and Premium Courses and Books Check out on Amazon store Check out on Khan Academy Check out on COURSEA Check out on Bright Storm Check out on Edx.com Read and Learn More about on BIGTHINK Get more Details about Bachelor's Degree Courses Here. These Course contents belong to HEC outline for this specific Subject. If you have any further inquiries, Please contact us for details via mail. All the data is extracted from HEC official website. The basic purpose for this to find all course subjects data on one page.

Natural Language Processing online course Natural Language Processing pdf download Natural Language Processing of Algorithms pdf book Free Download Natural Language Processing PDF Course Free Download Natural Language Processing book Free Download Natural Language Processing Complete Course Online Free Natural Language Processing course tutorials free download Natural Language Processing books for BS() Natural Language Processing books for bachelor degree Natural Language Processing outline free download Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Second Edition. Prentice Hall. Foundations of Statistical Natural Language Processing, Manning and Schütze, MIT Press. Cambridge, MA: May 1999

Read the full article

0 notes

globalmediacampaign · 4 years ago

Text

Creating an External Replica of AWS Aurora MySQL with Mydumper

Oftentimes, we need to replicate between Amazon Aurora and an external MySQL server. The idea is to start by taking a point-in-time copy of the dataset. Next, we can configure MySQL replication to roll it forward and keep the data up-to-date. This process is documented by Amazon, however, it relies on the mysqldump method to create the initial copy of the data. If the dataset is in the high GB/TB range, this single-threaded method could take a very long time. Similarly, there are ways to improve the import phase (which can easily take 2x the time of the export). Let’s explore some tricks to significantly improve the speed of this process. Preparation Steps The first step is to enable binary logs in Aurora. Go to the Cluster-level parameter group and make sure binlog_format is set to ROW. There is no log_bin option in Aurora (in case you are wondering), simply setting binlog_format is enough. The change requires a restart of the writer instance, so it, unfortunately, means a few minutes of downtime. We can check if a server is generating binary logs as follows: mysql> SHOW MASTER LOGS; +----------------------------+-----------+ | Log_name | File_size | +----------------------------+-----------+ | mysql-bin-changelog.034148 | 134219307 | | mysql-bin-changelog.034149 | 134218251 | ... Otherwise, you will get an error: ERROR 1381 (HY000): You are not using binary logging We also need to ensure a proper binary log retention period. For example, if we expect the initial data export/import to take one day, we can set the retention period to something like three days to be on the safe side. This will help ensure we can roll forward the restored data. mysql> call mysql.rds_set_configuration('binlog retention hours', 72); Query OK, 0 rows affected (0.27 sec) mysql> CALL mysql.rds_show_configuration; +------------------------+-------+------------------------------------------------------------------------------------------------------+ | name | value | description | +------------------------+-------+------------------------------------------------------------------------------------------------------+ | binlog retention hours | 72 | binlog retention hours specifies the duration in hours before binary logs are automatically deleted. | +------------------------+-------+------------------------------------------------------------------------------------------------------+ 1 row in set (0.25 sec) The next step is creating a temporary cluster to take the export. We need to do this for a number of reasons: first to avoid overloading the actual production cluster by our export process, also because mydumper relies on FLUSH TABLES WITH READ LOCK to get a consistent backup, which in Aurora is not possible (due to the lack of SUPER privilege). Go to the RDS console and restore a snapshot that was created AFTER the date/time where you enabled the binary logs. The restored cluster should also have binlog_format set, so select the correct Cluster parameter group. Next, capture the binary log position for replication. This is done by inspecting the Recent events section in the console. After highlighting your new temporary writer instance in the console, you should see something like this: Binlog position from crash recovery is mysql-bin-changelog.034259 32068147 So now we have the information to prepare the CHANGE MASTER command to use at the end of the process. Exporting the Data To get the data out of the temporary instance, follow these steps: Backup the schema Save the user privileges Backup the data This gives us added flexibility; we can do some schema changes, add indexes, or extract only a subset of the data. Let’s create a configuration file with the login details, for example: tee /backup/aurora.cnf SHOW GRANTS FOR 'user'@'%'; +---------------------------------------------------------+ | Grants for user@% | +---------------------------------------------------------+ | GRANT USAGE ON *.* TO 'user'@'%' IDENTIFIED BY PASSWORD | | GRANT SELECT ON `db`.* TO 'user'@'%' | +---------------------------------------------------------+ We can still gather the hashes and replace them manually in the pt-show-grants output if there is a small-ish number of users. pt-show-grants --user=percona -ppercona -hpercona-tmp.cgutr97lnli6.us-west-1.rds.amazonaws.com > grants.sql mysql> select user, password from mysql.user; Finally, run mydumper to export the data: mydumper -t 8 --compress --triggers --routines --events —-rows=10000000 -v 3 --long-query-guard 999999 --no-locks --outputdir /backup/export --logfile /backup/mydumper.log --regex '^(?!(mysql|test|performance_schema|information_schema|sys))' -O skip.txt --defaults-file /backup/aurora.cnf The number of threads should match the number of CPUs of the instance running mydumper. In the skip.txt file, you can include any tables that you don’t want to copy. The –rows argument will give you the ability to split tables in chunks of X number of rows. Each chunk can run in parallel, so it is a huge speed bump for big tables. Importing the Data We need to stand up a MySQL instance to do the data import. In order to speed up the process as much as possible, I suggest doing a number of optimizations to my.cnf as follows: [mysqld] pid-file=/var/run/mysqld/mysqld.pid log-error=/var/log/mysqld.log datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock log_slave_updates innodb_buffer_pool_size=16G binlog_format=ROW innodb_log_file_size=1G innodb_flush_method=O_DIRECT innodb_flush_log_at_trx_commit=0 server-id=1000 log-bin=/log/mysql-bin sync_binlog=0 master_info_repository=TABLE relay_log_info_repository=TABLE query_cache_type=0 query_cache_size=0 innodb_flush_neighbors=0 innodb_io_capacity_max=10000 innodb_stats_on_metadata=off max_allowed_packet=1G net_read_timeout=60 performance_schema=off innodb_adaptive_hash_index=off expire_logs_days=3 sql_mode=NO_ENGINE_SUBSTITUTION innodb_doublewrite=off Note that mydumper is smart enough to turn off the binary log for the importer threads. After the import is complete, it is important to revert these settings to “safer” values: innodb_doublewrite, innodb_flush_log_at_trx_commit, sync_binlog, and also enable performance_schema again. The next step is to create an empty schema by running myloader: myloader -d /backup/schema -v 3 -h localhost -u root -p percona At this point, we can easily introduce modifications like adding indexes, since the tables are empty. We can also restore the users at this time: (echo "SET SQL_LOG_BIN=0;" ; cat grants.sql ) | mysql -uroot -ppercona -f Now we are ready to restore the actual data using myloader. It is recommended to run this inside a screen session: myloader -t 4 -d /backup/export -q 100 -v 3 -h localhost -u root -p percona The rule of thumb here is to use half the number of vCPU threads. I also normally like to reduce mydumper default transaction size (1000) to avoid long transactions, but your mileage may vary. After the import process is done, we can leverage faster methods (like snapshots or Percona Xtrabackup) to seed any remaining external replicas. Setting Up Replication The final step is setting up replication from the actual production cluster (not the temporary one!) to your external instance. It is a good idea to create a dedicated user for this process in the source instance, as follows: CREATE USER 'repl'@'%' IDENTIFIED BY 'password'; GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%'; Now we can start replication, using the binary log coordinates that we captured before: CHANGE MASTER TO MASTER_HOST='aurora-cluster-gh5s6lnli6.us-west-1.rds.amazonaws.com', MASTER_USER='repl', MASTER_PASSWORD='percona', MASTER_LOG_FILE='mysql-bin-changelog.034259', MASTER_LOG_POS=32068147; START SLAVE; Final Words Unfortunately, there is no quick and easy method to get a large dataset out of an Aurora cluster. We have seen how mydumper and myloader can save a lot of time when creating external replicas, by introducing parallel operations. We also reviewed some good practices and configuration tricks for speeding up the data loading phase as much as possible. Optimize your database performance with Percona Monitoring and Management, a free, open source database monitoring tool. Designed to work with Amazon RDS MySQL and Amazon Aurora MySQL with a specific dashboard for monitoring Amazon Aurora MySQL using Cloudwatch and direct sampling of MySQL metrics. Visit the Demo https://www.percona.com/blog/2020/08/26/creating-an-external-replica-of-aws-aurora-mysql-with-mydumper/

0 notes

fuzzibare-blog · 6 years ago

Text

Security Everywhere - CloudFlare Global Outage Due to Regex Bug 02/07/19

Security Everywhere Cloudflare Global Outage July 2nd 2019

Cloudflare experienced a global outage of its CDNs for 27 minutes on the 2nd of July, making it the first global outage in eight years. It meant that 19% of all internet traffic was cut, with 550,000 businesses losing network connection for half an hour. Of the Cloudflare service users, the largest sector represented was in Hospital and Healthcare, meaning that hospitals were unable to provide efficient service to patients during that time, potentially endangering lives.

So how and why did it happen?

The root of the problem was a faulty Web Application Firewall (WAF) rule update, which was rolled out prematurely as a part of the new scheme to defend against emerging security threats. This was coupled with a poor recovery protocol as well as severe under-preparation by the Site Reliability Engineering team. This recent incident harks back to the first week of lectures, wherein we discussed that security issues often occur as a result of a chain of misfortunate events, or a chain of failures.

Unfortunately this seems like it is always the case, especially with large, dynamic and extremely complicated projects like building a Content Delivery Network, and it is extremely hard to mitigate. In this post I will discuss what went wrong, give my opinion on how they could have approached it better, and discuss their solutions.

The first error was introduced by a engineer who modified a WAF rule. The rule contained a single regex bug, which endlessly backtracked and blocked the processors from progressing. The reason this faulty regex was not caught was because Cloudflare had introduced a separate protocol for WAF rule updates, in response to increasing security threats, that allowed them to be pushed almost instantly to their customers and prevent any new malicious threats from manifesting.

The second error that occurred was that the WAF rule change was not even marked as urgent, it was of a standard priority. This meant that it should have passed through all the standard checks and been tested on internal sandboxes as well as incrementally rolled out. But because of the recent WAF rollout protocol changes, and the high volume of WAF rollouts (370 per month), this rule was not dispatched through the correct rollout procedure.

The third error was in the response to the global outage. Firstly, their internal response team was severely underprepared and did not know how to quickly and effectively handle the outage. On the initial discovery of the global outage, it took them time to understand that the outage was not caused by some unprecedented DDOS attack, but by their own system. Then they were unable to follow the protocol quickly because they had not been trained for such as situation, since the last global outage was more than eight years ago.

Secondly, the software they had in place to defend against a global outage was riddled with dependencies and fallacies that rendered them useless. When they had discovered the cause of the outage, they were not able to disable the offending WAF rule by flipping a global kill switch, because the kill switch required them to access their in-house authentication system. This authentication system was running on their own servers, which were hung. Effectively, the emergency stop button for their servers was only accessible from their unusable servers, much like leaving car keys locked inside one’s car.

Furthermore they needed to access a special bypass system to access their internal services including Jira and their build system, but were unable to do so because the system denied access to users who had not logged in frequently.

The result of these consecutive failures was a global outage that took out their entire network and the hundreds of thousands of services and websites running on them.

Now that you understand what went wrong, I will try and present my solutions on how to prevent this from ever occurring again.

Firstly, I do not believe that Cloudflare needs to re-evaluate their rollout process or reduce their WAF rule rollout frequency. They do not have a track record of failing rollouts, as they have not experienced a failure of this magnitude in eight years, despite their enormously high rate of WAF rule changes. Furthermore, they had proved that a previous immediate WAF rollout was able to prevent a highly malicious attack on their servers within the matter of hours, mitigating any serious damages to their client. I believe that human error inevitably occurs at some point in time, so it is the response protocol that is important.

I believe that they should have had fewer dependencies in their recovery system, especially since it was a last resort. Their response system was entirely dependent on the very servers that they were trying to rescue, meaning that they were in all respects rendered ineffective once the outage had occurred. They should have had their own dedicated servers which were tested against emergency circumstances to make sure that they can push changes throughout the CDNs when disaster strikes.

Secondly, they should compartmentalize their server CPU usage so that it does not hang as a result of one process. If they had used a different regex engine which did not continue processing instructions after a reasonable limit, or if they had built in a kill switch which disabled or bypassed processes which were hung, they would not have had this issue.

Thirdly, I believe they should have trained their emergency response team regularly, for example every six months, to be prepared for such occurrences. They should undergo drills to practice using the recovery systems, following the recovery protocol, and also researching new methods to design their recovery protocol to reduce the time taken and damage inflicted. These drills should be conducted on their internal networks with real attacks and real network destroying bugs purposefully installed to simulate this situation. Of course, further precautions should be taken to make sure these bugs do not end up in the production code.

The actual changes implemented by Cloudflare are as follows:

1. Re-introduce the excessive CPU usage protection that got removed. (Done)

2. Manually inspecting all 3,868 rules in the WAF Managed Rules to find and correct any other instances of possible excessive backtracking. (Inspection complete)

3. Introduce performance profiling for all rules to the test suite. (ETA: July 19)

4. Switching to either the re2 or Rust regex engine which both have run-time guarantees. (ETA: July 31)

5. Changing the SOP to do staged rollouts of rules in the same manner used for other software at Cloudflare while retaining the ability to do emergency global deployment for active attacks.

6. Putting in place an emergency ability to take the Cloudflare Dashboard and API off Cloudflare's edge.

7. Automating update of the Cloudflare Status page.

I believe that the actual changes implemented are quite poor. Firstly, I would like to commend them for resolving the error directly, by introducing CPU usage protections, as well as using a more reliable Regex Engine that will not backtrack indefinitely.

However the manual inspection of all 3868 WAF rules to look for specific instances of backtracking, as well as the performance profiling, is a poor approach in my opinion. The rules are being changed 370 times a month, or every 3 hours. Even if they had identified and fixed several faulty rules, in one year, there is the potential for a thousand new changes to be introduced. Furthermore, another undiscovered bug could lie in any part of their code, and be caused by a multitude of other reasons than solely regex backtracking. They have taken the approach of patching the error when it occurred, but not the root problem, which was their recovery system.

In their released response, it seems they entirely ignored decoupling their recovery system from their servers. It is like diving into a lake to save a drowning person, who could pull you underwater with them, or trying to use your hands to pull a person away from a live wire: the very thing you are trying to save is the danger, and by tying yourself to it, you are putting yourself in danger. They should have addressed this dependency issue, as an effective recovery system would resolve all cases of system failure immediately, which is more effective than trying to pre-emptively discover all potential bugs.

Overall this case study highlights a few points. Firstly it highlights that large companies are very prone to major, devastating bugs and failures, so no one should rely on any one service entirely, or you are simply joining a sinking ship with a hole no one has found yet. If you are a business using a CDN, run some of your services through another CDN, or have your own in house backup servers to handle emergency network failure. Secondly, it highlights that even large companies with extremely experienced professions do not always respond correctly to emergencies, which highlights the need for retraining or hiring more adequately trained professionals.

0 notes

somelongstoryshort-blog1 · 7 years ago

Text

Will You Migrate From Perl To Python?

Both Python and Perl are develop, open source, universally useful, abnormal state, and deciphered programming dialects. In any case, the use measurements posted on different sites delineate that Python is as of now more well known than Perl. Subsequently, a product designer can upgrade his vocation prospects by changing structure Perl to Python.

A tenderfoot can additionally learn and utilize Python programming dialect without putting additional time and exertion. Nonetheless, you should not change to another programming dialect since its prevalence and utilization. You should remember the real contrasts between the two programming dialects while choosing about moving from Perl to Python.

12 Points You Must Keep in Mind while Switching from Perl to Python

1) Design Goal

Perl was initially outlined as a scripting dialect to improve report handling abilities. Henceforth, it accompanies worked in content handling capacity. Then again, Python was outlined at first as a diversion programming dialect. Be that as it may, it was planned with highlights to enable developers to manufacture applications with compact, discernable and reusable code. The two programming dialects still vary in the class of highlights and execution.

2) Syntax Rules

The punctuation standards of both Python and Perl are impacted by a few other programming dialects. For example, Perl acquires highlights from various programming dialects including C, shell content, sed, AWK and Lisp. Moreover, Python executes practical programming highlights in a way like Lisp. In any case, Python is tremendously well known among present day programming dialects because of its basic language structure rules. Notwithstanding being anything but difficult to utilize, the language structure tenets of Python additionally empower software engineers to aside from numerous ideas with less and lucid code.

3) Family of Languages

Perl has a place with a group of abnormal state programming dialects that incorporates Perl 5 and Perl 6. The renditions 5 and 6 of Perl are perfect with each other. A designer can undoubtedly relocate from Perl 5 to Perl 6 without putting additional time and exertion. The software engineers have choice to look over two particular adaptations of Python - Python 2 and Python 2. Be that as it may, the two forms of Python are not good with each other. Henceforth, a developer needs to look over two particular adaptations of the programming dialect.

4) Ways to Achieve Same Results

Python empowers software engineers to express ideas without composing longer lines of code. Be that as it may, it expects software engineers to fulfill undertakings or accomplish brings about a particular and single way. Then again, Perl empower software engineers to fulfill a solitary errand or accomplish similar outcomes in various ways. Consequently, numerous software engineers observe Perl to be more adaptable than Python. Be that as it may, the different approaches to accomplish a similar outcome regularly make the code written in Perl chaotic and application hard to keep up.

5) Web Scripting Language

Perl was initially composed as a UNIX scripting dialect. Numerous engineers utilize Perl as a scripting dialect to profit its inherent content preparing abilities. In any case, there are many web designers who whine that Perl is slower than other generally utilized scripting dialect. Python is likewise utilized broadly by software engineers for web application advancement. In any case, it needs implicit web advancement abilities. Thus, designers need to benefit different systems and devices to compose web applications in Python productively and quickly.

6) Web Application Frameworks

Most designers these days benefit the instruments and highlights gave by different structures to manufacture web applications productively and quickly. Perl web software engineers have choice to look over a variety of structures including Catalyst, Dancer, Mojolicious, Poet, Interchange, Jifty, and Gantry. Moreover, the web designers likewise have alternative to utilize various Python web systems including Django, Flask, Pyramid, Bottle and Cherrypy. Be that as it may, the quantity of Python web structure is substantially higher than the quantity of Perl web systems.

7) Usage

As said before, both Python and Perl are universally useful programming dialects. Consequently, each programming dialect is utilized for building up an assortment of programming applications. Perl is utilized generally for realistic and organize programming, framework organization, and advancement of fund and biometric applications. In any case, Python accompanies a powerful standard library disentangles web application improvement, logical registering, huge information arrangement advancement, and computerized reasoning errands. Subsequently, designers favor utilizing Python for improvement of cutting edge and mission-basic programming applications.

8) Performance and Speed

Various examinations have appeared than Python is slower than other programming dialects like Java and C++. Henceforth, designers often investigate approaches to upgrade the execution speed of Python code. A few engineers even supplant default Python runtime with their own particular custom runtime to influence the Python applications to run speedier. Numerous software engineers even observe Perl to be speedier than Python. Many web engineers utilize Perl as a scripting dialect make the web applications speedier, and convey upgraded client encounter.

9) Structured Data Analysis

At present, huge information is one of the most sultry patterns in programming advancement. Many endeavors these days assemble custom applications for gathering, putting away, and examining colossal measure of organized and unstructured information. The PDL gave by Perl empowers engineers to break down enormous information. The implicit content handling capacity of Perl additionally disentangles and accelerates examination of gigantic measure of organized information. Be that as it may, Python is utilized generally by developers for information examination. The designers additionally exploit powerful Python libraries like Numpy to process and examine gigantic volumes of information in a quicker and more effective way.

10) JVM Interoperability

At show, Java is one of the programming dialects that are utilized broadly for advancement of desktop, web, and versatile applications. In contrast with Perl, Python interoperates with Java Virtual Machine (JVM) consistently and productively. Henceforth, the engineers have alternative to compose Python code than runs easily on JVM, while exploiting powerful Java APIs and items. The interoperability causes software engineers to construct application by focusing on the prominent Java stage, while composing code in Python rather than Java.

11) Advanced Object Oriented Programming

Both Perl and Python are protest arranged programming dialects. However, Python executes propelled protest situated programming dialects betterly than Perl. While composing code in Perl, software engineers still need to utilize bundles rather than classes. Python software engineers can compose high caliber and secluded code by utilizing classes and protests. Numerous designers think that its hard to keep the code straightforward and meaningful while composing object arranged code in Perl. However, Perl makes it less demanding for software engineers to achieve an assortment of assignments just by utilizing jokes on the summon line.

12) Text Processing Capability

Not at all like Python, Perl was composed with worked in content preparing capacities. Thus, numerous software engineers favor utilizing Perl for report age. Perl additionally makes it less demanding for software engineers to perform regex and string examination operations like coordinating, substitution, and substitution. It additionally does not expect designers to compose extra code to perform special case taking care of and I/O operations. Consequently, numerous developers favor Perl to Python while building applications that need to process literary information or produce reports.

udemy free online courses

All in all, countless programming engineers incline toward Python to Perl. In any case, there are various programming dialects - Java, C, C++ and C# - which are at present more well known than both Perl and Python. Additionally, Python, as different innovations, likewise has its own weaknesses. For example, you will be required to utilize Python structures while composing applications in the programming dialect. Consequently, you should remember the upsides and downsides of both programming dialects before relocating from Perl to Python.

0 notes