Recent Entries 10
- pattern minor 112d agoPreprocessing steps to follow while cleaning and extracting text data from tweetsI have a dataset of around 200,000 tweets. I am running a classification task on them. Dataset has two columns - class label and the tweet text. In the preprocessing step I am passing the dataset through following cleaning step: ``` import re from nltk.corpus import stopwords import pandas as pd def preprocess(raw_text): # keep only words letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text) # convert to lower case and split words = letters_only_text.lower().split() # remove stopwords stopword_set = set(stopwords.words("english")) meaningful_words = [w for w in words if w not in stopword_set] # join the cleaned words in a list cleaned_word_list = " ".join(meaningful_words) return cleaned_word_list def process_data(dataset): tweets_df = pd.read_csv(dataset,delimiter='|',header=None) num_tweets = tweets_df.shape[0] print("Total tweets: " + str(num_tweets)) cleaned_tweets = [] print("Beginning processing of tweets at: " + str(datetime.now())) for i in range(num_tweets): cleaned_tweet = preprocess(tweets_df.iloc[i][1]) cleaned_tweets.append(cleaned_tweet) if(i % 10000 == 0): print(str(i) + " tweets processed") print("Finished processing of tweets at: " + str(datetime.now())) return cleaned_tweets cleaned_data = process_data("tweets.csv) ``` And here is the relevant output: ``` Total tweets: 216041 Beginning processing of tweets at: 2017-05-16 13:45:47.183113 Finished processing of tweets at: 2017-05-16 13:47:01.436338 ``` It's taking approx. 2 minutes to process the tweets. Although it looks relatively a small timeframe for current dataset I would like to improve it further especially when I use a dataset of much bigger size. Can the steps/code in the `preprocess(raw_text)` method be improved in order to achieve faster execution?
- debug minor 112d agoC gets_s() implementationI'm in a class that is using C, and my instructor has unfortunately used `gets()` in sample code. As this is obviously a heinous oversight, likely to cause undefined behavior and other various issues (only a little sarcasm), I decided to implement `gets_s()`, because it was a fun exercise and sometimes it's just not worth it to do full error checking with `fgets()` and you just want to truncate unexpectedly long lines. I'm not concerned whether this fully implements `gets_s()` as specified in the C11 standard -- this is just supposed to be a drop-in replacement for `gets()` that doesn't overrun your buffer. However, what is very important is that this function actually does what it advertises: it's safe and doesn't overrun the buffer. This is my first time working in C (I usually use java or kotlin), and I appreciate all tips, though I'd like at least some mention of the safety of this code, and am also interested in portability (to current compilers). `gets_s.h` ``` #include #include #define GETS_S_OK 0 #define GETS_S_ERROR 1 #define GETS_S_OVERRUN 2 static inline int gets_s( char str[], int n ) { char *str_end, *fgets_return; int temp; fgets_return = fgets( str, n, stdin ); /* If fgets fails, it returns NULL. This includes the case where stdin is exhausted. */ if ( fgets_return == NULL ) { str[0] = '\0'; return GETS_S_ERROR; } str_end = str + strlen(str) - 1; if ( str_end == '\n' ) { *str_end = '\0'; return GETS_S_OK; } temp = fgetc( stdin ); if (temp == EOF || temp = '\n') return GETS_S_OK; do temp = fgetc( stdin ); while ( temp != EOF && temp != '\n' ); return GETS_S_OVERRUN; } ``` and a small test file: `gets_s.c` ``` #include "gets_s.h" #include int main() { char buffer[10]; int gets_s_return; printf("Enter up to %d characters safely.\n", sizeof(buffer) - 1); gets_s_return = gets_s( buffer, sizeof(buffer) ); printf("buffer = %s", buffer); printf("gets_s return = %d", get
- pattern minor 112d agoMinimal substring with all characters contained in stringI was given following task in the interview: Given a string, find shortest substring in it that contains all of the different characters contained in the original string. Here is my solution in emacs lisp: ``` (defun add-to-hash (c ht) "Add an instance of character C to hash table HT" (puthash c (1+ (gethash c ht 0)) ht)) (defun remove-from-hash (c ht &optional try) "Remove an instance of character C from hash table HT Return nil if no instances of C remain in HT" (let ((old (gethash c ht 0))) (if (and try ( 1 (puthash c (1- (gethash c ht 0)) ht)) (remhash c ht) t)))) (defun find-min-substring (sstr) "Find minimal substring that contains all the characters in a given string" ;; get all characters (let* ((all-chars (make-hash-table)) (slen (length sstr)) (fcnt (progn (mapc (lambda (c) (add-to-hash c all-chars)) sstr) (hash-table-count all-chars))) (beg 0) (end fcnt) (res sstr)) (if (= end slen) res (let* ((cand (substring sstr beg end)) (cand-chars (make-hash-table)) (ccnt (progn (mapc (lambda (c) (add-to-hash c cand-chars)) cand) (hash-table-count cand-chars)))) ;; find first candidate, that is a substring with all the characters (while (< ccnt fcnt) (add-to-hash (aref sstr end) cand-chars) (setq end (1+ end)) (setq ccnt (hash-table-count cand-chars))) (setq cand (substring sstr beg end)) ;; shorten it as much as possible (while (remove-from-hash (aref sstr beg) cand-chars t) (setq beg (1+ beg))) (setq cand (substring sstr beg end)) (setq res cand) ;; check other variants (while (< end slen) ;; advance both ends (add-to-hash (aref sstr end) cand-chars) (setq end (1+ end)) (remove-from-hash (aref sstr beg) cand-chars)
- pattern minor 112d agoCalculate the sum at a level of a binary tree represented in a StringFair preface: This is an interview question I would like to improve my knowledge of. I got some rest, and re-solved the problem with a fresh/unpanicked mind. Given input of the type `(10(5(3)(12))(14(11)(17)))` which would represent the following tree ``` n=0 10 n=1 5 14 n=2 3 12 11 17 ``` My task is to find the summation of values at a particular tier, like \$5+14=19\$ which is the sum for \$n=1\$, or \$3+12+11+17=43\$ the sum for \$n=2\$. Considering this is a binary tree, a recursive function seems appropriate. My main utility functions are: - `stripFirstLastParen` – strips the first and last paren - `getCurrentVal` – retrieves the value of the current node - `getChildren` – retrieves the left and right nodes `var input = "(10(5(3)(12))(14(11)(17)))"; function stripFirstLastParen(input){ if(input[0] !== "(" || input[input.length - 1] !== ")"){ console.error("unbalanced parens") } else{ input = input.substr(1); input = input.substring(0, input.length - 1); } return input; } function getCurrentVal(input){ var val = ""; while(input[0] !== "(" && input[0] !== ")" && input.length){ val += input[0]; input = input.substr(1); } return { val, input } } function getChildren(input){ var val = ""; if(input.length == 0){ return { left: "", right: "" } } if(input[0] !== "("){ console.error("no open paren at start"); } val += input[0]; input = input.substr(1); var parenStack = 1; while(parenStack > 0){ if(input[0] == ")"){ parenStack--; } else if(input[0] == "("){ parenStack++; } val += input[0]; input = input.substr(1); } return { left: val, right: input } } function getValueForLevel(input, k){ var totalValue = 0; input = stripFirstLastParen(input); var currentValue = getCurrentVal(input); var children = getChildren(currentValue.input); if(k > 0){ if(children.left.leng
- pattern major 112d agoJoin List with SeparatorI was implementing something similar to Python's join function, where ``` join([a1, a2, ..., aN], separator :: String) ``` returns ``` str(a1) + separator + str(a2) + separator + ... + str(aN) ``` e.g., ``` join([1, 2, 3], '+') == '1+2+3' ``` I was implementing something similar and was wondering, what is a good pattern to do this? Because there is the issue of only adding the separator if it is not the last element ``` def join(l, sep): out_str = '' for i, el in enumerate(l): out_str += '{}{}'.format(el, sep) return out_str[:-len(sep)] ``` I'm quite happy with this, but is there a canoncial approach?
- pattern minor 112d agoWeighted Uniform StringsProblem Adapted from this HackerRank problem. Instead of printing `YES` or `NO`, I just want to return the `Set` of all possible weights for an input `String`. I found this example to be illustrative Implementation Some of my thoughts are in the implementation as comments `import java.util.HashMap; import java.util.HashSet; import java.util.Map; import java.util.Set; public class WeightedUniformStrings { public static Set getWeights(String s) { // I thought about using streams here, but I'm not using java 8 Set weights = new HashSet<>(); Map consecutiveCharacterCounts = WeightedUniformStrings.getConsecutiveCharactersCounts(s); for (Map.Entry entry : consecutiveCharacterCounts.entrySet()) { if (entry.getValue() != null && entry.getKey() != null) { long weight = WeightedUniformStrings.calculateWeight(entry.getKey()); for (long i = 0; i getConsecutiveCharactersCounts(String s) { if (s.isEmpty()) { return new HashMap<>(); } Map consecutiveCharacterCounts = new HashMap<>(); char[] chars = s.toCharArray(); // I'm not a big fan of this initialization + for loop - but I haven't thought of a better alternative implementation char startingCharacter = chars[0]; long consecutiveCharacterCount = 1; consecutiveCharacterCounts.put(startingCharacter, consecutiveCharacterCount); for (int i = 1; i characterCount) { consecutiveCharacterCounts.put(startingCharacter, consecutiveCharacterCount); } } // Doing this logical check twice feels weird if (currentCharacter != startingCharacter) { startingCharacter = currentCharacter; consecutiveCharacterCount = 1; Long characterCount = consecutiveCharacterCounts.get(startingCharacter); if (characterCount == null || consecutiveCharacterCount > characterCount) { consecutiveCharacterCounts.put(startingCharacter, consecutiveCharacterCount); } } } return con
- pattern minor 112d agoReplace ' ' Spaces in a URL with a '%20' - Coding ChallengeThis was a small coding challenge proposed at my school. The problem is: Given a string (or URL), replace the white-spaces inside of the string with `%20`, but any spaces on the outside should not be included. For example, input `" Mr John Smith "` would return `"Mr%20John%20Smith"`. I have completed this challenge successfully using the code below. My question is, is there any way to improve the efficiency? I believe the complexity is currently \$O(2n) = O(n)\$ given the 2 for loops. I do not want to use any libraries or functions like `str.replace()`. But I'm assuming there is a better way than trimming and then counting the whitespace. ``` public class URL { /** * @description URLify ~ A small method that makes a string with spaces URL Friendly! * @param str * @param length * @return String */ public static String URLify(String str) { str = str.trim(); int length = str.length(); int trueL = length; if(str.contains(" ")) { for(int i = 0; i < length; i++) { if(str.charAt(i) == ' ') { trueL = trueL + 2; } } char[] oldArr = str.toCharArray(); char[] newArr = new char[trueL]; int x = 0; for(int i = 0; i < length; i++) { if(oldArr[i] == ' ') { newArr[x] = '%'; newArr[x+1] = '2'; newArr[x+2] = '0'; x += 3; } else { newArr[x] = oldArr[i]; x++; } } str = new String(newArr, 0, trueL); } return str; } public static void main(String[] args) { String str = " https://google.com/ testing .pdf "; str = URLify(str); System.out.print(str); } } ``` Note: I am looking for any criticism. I would really like to improve my skills in ge
- pattern minor 112d agoPrimitive solution of squeezeAs some of you might know I'm working my way through The C Programming Language as recommended by the "The Definitive C Book Guide and List" on StackOverflow. I just finished my version of `squeeze(s1, s2)`, a function to remove all occurrences of s2 in s1 and was hoping to get some feedback on my solutions. This exercise is part of the second chapter (ex 2.4) so we haven't really gone through pointers and de-referencing yet. So basically all experience I have with the language is with using basic and primitive data types and the previous feedback I've received here! All sort of feedback would be greatly appreciated; syntactical changes, algorithmic shortcomings etc. Just keep in mind that I just recently started out with the language! Here is the code: ``` /* Exercise 2.4 squeeze (s1, s2): Remove all characters of s2 in s1. INPUT : s1.length >= s2 > 0. OUTPUT: The rest of s1 after deleting all occurances of letters in s2. */ #include void squeeze (char s1[], const char *s2); /* Returns (by-ref) the resulting string s1 after removing all occurences of s2. */ int toUpper(char c); /* Returns the numerical representation of a hexadecimal digit. */ int contains(char toCheck, const char *toRemove); /* Function to see if toRemove contains toCheck. */ int main () { char s1[] = "I am a test.\0"; const char *s2 = "AE\0"; printf("Before manipulation: %s\n", s1); squeeze(s1, s2); printf("After manipulation: %s", s1); } /* Returns the (by-ref) resulting string s1 after removing all occurences of letters in s2. */ void squeeze (char s1[], const char *s2) { int index, left_shift; for (index = 0, left_shift = 0; s1[index] != '\0'; index++) { if (contains(s1[index], s2)) continue; s1[left_shift] = s1[index]; left_shift++; } s1[left_shift] = '\0'; } /* Returns the upper-case representation of char c. */ int toUpper (char c) { if (c >= 'a' && c
- pattern minor 112d agoLength of longest palindrome subsequenceI had an interview, where I was asked to solve to this question: Given an string str, find the maximum length of palindrome subsequence. Example: >if str = 'abcda': maximum_length = 3 'aca' How can I improve this code? ``` def maximum_length_palindrome_subsequence(a): a = list(a) n = len(a) table = [[0 for i in xrange(n)] for j in xrange(n)] for i in xrange(n): table[i][i] = 1 for gap in xrange(1,n): for i in xrange(n-1): h = i+gap if h>=n: continue if a[i] == a[h]: if i+1==h: table[i][h] = 2 else: table[i][h] = table[i+1][h-1] + 2 else: if h<n: table[i][h] = max(table[i+1][h], table[i][h-1]) return table[0][n-1] print maximum_length_palindrome_subsequence('abcbda') ```
- pattern minor 112d agoFor each pair of strings, eliminate the characters they have in commonIn the below code there are 3 inputs. First line of input is the number of testcases for which the program will run where (1<=testcases<=10) and `i` is an `int`. For each testcase Second and third input will be a `String` where (1<=Stringlength<=10^5) Output Now for the output we need to compare both the strings and remove similar characters from both them. The output will be based on which string encounters 0 length first. if none of them is 0 after completion then it will be a draw. Time of execution is 1.0 sec for each input but my time for 10^5 strings is 2.0. Is there a way to increase its efficiency based on the time. ``` import java.io.*; import java.util.*; public class MyProg{ public static void main(String[] args) throws IOException { BufferedReader br=new BufferedReader(new InputStreamReader(System.in)); System.out.println("Enter the number of testcases"); String ts=br.readLine(); int testcase=Integer.parseInt(ts); for(int i=0;i s1 = new ArrayList(); List s2 = new ArrayList(); for(char c : sa.toCharArray()) s1.add(c); for(char c : sb.toCharArray()) s2.add(c); for(char c:s2) { s1.remove((Character)c); } for(char c:sa.toCharArray()) { s2.remove((Character)c); } if(s1.size()==0 && s2.size()>0) System.out.println("You lose some."); else if(s1.size()>0 && s2.size()==0) System.out.println("You win some."); else System.out.println("You draw some."); } }} ```