Skip navigation
NASA Logo, National Aeronautics and Space Administration
Currently Being Moderated

Using Python3 For text Processing

VERSION 2  Click to view document history
Created on: Sep 28, 2017 9:21 AM by Arshpreet singh - Last Modified:  Sep 28, 2017 10:29 AM by Arshpreet singh

Python is more about 'Programming like Hacker' while writing your code if you keep things in mind like reference counting, type-checking, data manipulation, using stacks, managing variables,eliminating usage of lists, using less and less "for" loops could really warm up your code for great looking code as well as less usage of system-resources with great Speed.

 

Slower than C( and C++ and Frotran) : Yes Python is slower than C but you really need to ask yourself that what is fast or what you really want to do. There are several methods to write Fibonacci in Python. Most popular is one using 'for loop' only because most of the programmers coming from C background uses lots and lots of for loops for iteration. Python has for loops as well but if you really can avoid for loop by using internal-loops provided by Python Data Structures and Numpy like libraries for array handling You will have Win-Win situation most of the times. Now let's go with some Python tricks those are Super cool if you are the one who manipulates,Filter,Extract,parse data most of the time in your job. Python has many inbuilt text processing methods:

 

 

>>> m = ['i am amazing in all the ways I should have']

>>> m[0]

'i am amazing in all the ways I should have'
>>> m[0].split()

['i', 'am', 'amazing', 'in', 'all', 'the', 'ways', 'I', 'should', 'have']

>>> n = m[0].split()

>>> n[2:]

['amazing', 'in', 'all', 'the', 'ways', 'I', 'should', 'have']

>>> n[0:2]

['i', 'am']

>>> n[-2]

'should'
>>>
>>> n[:-2]

['i', 'am', 'amazing', 'in', 'all', 'the', 'ways', 'I']

>>> n[::-2]

['have', 'I', 'the', 'in', 'am']

 

 

Those are uses of lists to do string manipulation. Yeah no for loops.  Interesting portions of Collections module: Now let's talk about collections. Counter is just my personal favorite.  When you have to go through 'BIG' lists and see what are actually occurrences:

 

from collections import Counter

>>> Counter(xrange(10))

Counter({0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})

>>> just_list_again = Counter(xrange(10))

>>> just_list_again_is_dict = just_list_again

>>> just_list_again_is_dict[1]

1
>>> just_list_again_is_dict[2]

1
>>> just_list_again_is_dict[3]

1
>>> just_list_again_is_dict['3']

0
Some other methods using counter:
Counter('abraakadabraaaaa')

Counter({'a': 10, 'r': 2, 'b': 2, 'k': 1, 'd': 1})

>>> c1=Counter('abraakadabraaaaa')

>>> c1.most_common(4)

[('a', 10), ('r', 2), ('b', 2), ('k', 1)]

>>> c1['b']

2
>>> c1['b'] # work as dictionary
2
>>> c1['k'] # work as dictionary
1
>>> type(c1)

<class 'collections.Counter'>
>>> c1['b'] = 20
>>> c1.most_common(4)

[('b', 20), ('a', 10), ('r', 2), ('k', 1)]

>>> c1['b'] += 20
>>> c1.most_common(4)

[('b', 40), ('a', 10), ('r', 2), ('k', 1)]

>>> c1.most_common(4)

[('b', 20), ('a', 10), ('r', 2), ('k', 1)]

 

 

Aithematic and uniary operations:

 


>>> from collections import Counter

>>> c1=Counter('hello hihi hoo')

>>> +c1

Counter({'h': 4, 'o': 3, ' ': 2, 'i': 2, 'l': 2, 'e': 1})

>>> -c1

Counter()

>>> c1['x']

0

 

 

Counter is like a dictionary but it also considers the counting important of all the content you are looking for. So you can plot the stuff on Graphs.  OrderedDict: it makes your chunks of data into meaningful manner.

 

 

>>> from collections import OrderedDict
>>> d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}
>>> new_d = OrderedDict(sorted(d.items()))
>>> new_d
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])
>>> for key in new_d:
...     print (key, new_d[key])
... apple 4banana 3orange 2pear 1

 

 

Namedtuple: Think it the way you need to save each line of your CSV into list of lines but along with that you also need to take care of not just the memory but as well as You should be able to store each line as dictionary data structure so if you are fetching lines from Excel or CSV document which comes in place when you work at Data-Processing environment.

 

 

# The primitive approachlat_lng = (37.78, -122.40)
print 'The latitude is %f' % lat_lng[0]
print 'The longitude is %f' % lat_lng[1]

# The glorious namedtupleLatLng = namedtuple('LatLng', ['latitude', 'longitude'])
lat_lng = LatLng(37.78, -122.40)
print 'The latitude is %f' % lat_lng.latitude
print 'The longitude is %f' % lat_lng.longitude

 

 

ChainMap: It is Container of Containers: Yes that's really true.   You better be above Python3.3 to try this code.

 


>>> from collections import ChainMap

>>> a1 = {'m':2,'n':20,'r':490}

>>> a2 = {'m':34,'n':32,'z':90}

>>> chain = ChainMap(a1,a2)

>>> chain

ChainMap({'n': 20, 'm': 2, 'r': 490}, {'n': 32, 'm': 34, 'z': 90})

>>> chain['n']

20

 

>>> new_chain = ChainMap({'a':22,'n':27},chain)

>>> new_chain['a']

22
>>> new_chain['n']

27

 

 

Comprehensions: You can also do comprehensions with dictionaries or sets as well.

 

 

>>> m = {'a': 1, 'b': 2, 'c': 3, 'd': 4}

>>> m

{'d': 4, 'a': 1, 'b': 2, 'c': 3}

>>> {v: k for k, v in m.items()}

{1: 'a', 2: 'b', 3: 'c', 4: 'd'}


 

 

StartsWith and EndsWith methods for String Processing: Startswith, endswith. All things have a start and an end. Often we need to test the starts and ends of strings. We use the startswith and endswith methods.

 

 

phrase = "cat, dog and bird"
# See if the phrase starts with these strings.if phrase.startswith("cat"):
    print(True)

if phrase.startswith("cat, dog"):
    print(True)

# It does not start with this string.if not phrase.startswith("elephant"):
    print(False)

Output

TrueTrueFalse

 

 

Map and IMap as inbuilt functions for iteration:  map is rebuilt in Python3 using generators expressions under the hood which helps to save lot of memory but in Python2 map uses dictionary like expressions so you can use 'itertools' module in python2 and in itertools the name of map function is changed to imap.(from itertools import imap)

 

 

>>>m = lambda x:x*x
>>>print m
 at 0x7f61acf9a9b0>>>>print m(3)
9
# now as we understand lamda returns the values of expressions for various functions as well, one just have to look# for various other stuff when you really takes care of other things
>>>my_sequence = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
>>>print map(m,my_sequence)
[1,4,9,16,25,36,49,64,81,100,121,144,169,196,225,256,289,324,361,400]

#so square is applied on each element without using any loop or if.
Comments (0)

Bookmarked By (0)

More Like This

  • Retrieving data ...

More by Arshpreet singh

USAGov logo NASA Logo - nasa.gov